linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6] ext4: Add atomic write support for DIO
@ 2024-10-25  3:45 Ritesh Harjani (IBM)
  2024-10-25  3:45 ` [PATCH 1/6] ext4: Add statx support for atomic writes Ritesh Harjani (IBM)
                   ` (5 more replies)
  0 siblings, 6 replies; 39+ messages in thread
From: Ritesh Harjani (IBM) @ 2024-10-25  3:45 UTC (permalink / raw)
  To: linux-ext4
  Cc: Theodore Ts'o, Jan Kara, Darrick J . Wong, Christoph Hellwig,
	John Garry, Ojaswin Mujoo, Dave Chinner, linux-kernel, linux-xfs,
	linux-fsdevel, Ritesh Harjani (IBM)

This series adds the base feature support to enable atomic writes in
direct-io path for ext4. We advertise the minimum and the maximum atomic
write unit sizes via statx on a regular file.

This series allows users to utilize atomic write support using -
1. on bs < ps systems via - mkfs.ext4 -F -b 16384 /dev/sda
2. on bs = ps systems via bigalloc - mkfs.ext4 -O bigalloc -F -b 4096 -C 65536 /dev/sda

This can then be utilized using -
	xfs_io -fdc "pwrite -V 1 -A -b16k 0 16k" /mnt/f1

This is built on top of John's DIO atomic write series for XFS [1].
The VFS and block layer enablement for atomic writes were merged already.

[1]: https://lore.kernel.org/linux-xfs/20241019125113.369994-1-john.g.garry@oracle.com/

RFC -> PATCH:
- Dropped RFC tag
- Last RFC was posted a while ago but back then a lot of VFS and block layer
  interfaces were still not merged. Those are now merged, thanks to John and
  everyone else.
- [RFC] - https://lore.kernel.org/linux-ext4/cover.1709356594.git.ritesh.list@gmail.com/

Ritesh Harjani (IBM) (6):
  ext4: Add statx support for atomic writes
  ext4: Check for atomic writes support in write iter
  ext4: Support setting FMODE_CAN_ATOMIC_WRITE
  ext4: Warn if we ever fallback to buffered-io for DIO atomic writes
  iomap: Lift blocksize restriction on atomic writes
  ext4: Add atomic write support for bigalloc

 fs/ext4/ext4.h       | 14 +++++++++++++-
 fs/ext4/file.c       | 27 ++++++++++++++++++++++++++-
 fs/ext4/inode.c      | 27 +++++++++++++++++++++++++++
 fs/ext4/super.c      | 37 +++++++++++++++++++++++++++++++++++++
 fs/iomap/direct-io.c |  2 +-
 5 files changed, 104 insertions(+), 3 deletions(-)

--
2.46.0


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 1/6] ext4: Add statx support for atomic writes
  2024-10-25  3:45 [PATCH 0/6] ext4: Add atomic write support for DIO Ritesh Harjani (IBM)
@ 2024-10-25  3:45 ` Ritesh Harjani (IBM)
  2024-10-25  9:41   ` John Garry
  2024-10-25  3:45 ` [PATCH 2/6] ext4: Check for atomic writes support in write iter Ritesh Harjani (IBM)
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 39+ messages in thread
From: Ritesh Harjani (IBM) @ 2024-10-25  3:45 UTC (permalink / raw)
  To: linux-ext4
  Cc: Theodore Ts'o, Jan Kara, Darrick J . Wong, Christoph Hellwig,
	John Garry, Ojaswin Mujoo, Dave Chinner, linux-kernel, linux-xfs,
	linux-fsdevel, Ritesh Harjani (IBM)

This patch adds base support for atomic writes via statx getattr.
On bs < ps systems, we can create FS with say bs of 16k. That means
both atomic write min and max unit can be set to 16k for supporting
atomic writes.

Later patches adds support for bigalloc as well so that ext4 can also
support doing atomic writes for bs = ps systems.

Co-developed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/ext4/ext4.h  |  7 ++++++-
 fs/ext4/inode.c | 14 ++++++++++++++
 fs/ext4/super.c | 32 ++++++++++++++++++++++++++++++++
 3 files changed, 52 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 44b0d418143c..a41e56c2c628 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1729,6 +1729,10 @@ struct ext4_sb_info {
 	 */
 	struct work_struct s_sb_upd_work;
 
+	/* Atomic write unit values */
+	unsigned int fs_awu_min;
+	unsigned int fs_awu_max;
+
 	/* Ext4 fast commit sub transaction ID */
 	atomic_t s_fc_subtid;
 
@@ -1820,7 +1824,8 @@ static inline int ext4_valid_inum(struct super_block *sb, unsigned long ino)
  */
 enum {
 	EXT4_MF_MNTDIR_SAMPLED,
-	EXT4_MF_FC_INELIGIBLE	/* Fast commit ineligible */
+	EXT4_MF_FC_INELIGIBLE,	/* Fast commit ineligible */
+	EXT4_MF_ATOMIC_WRITE	/* Supports atomic write */
 };
 
 static inline void ext4_set_mount_flag(struct super_block *sb, int bit)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 54bdd4884fe6..897c028d5bc9 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5578,6 +5578,20 @@ int ext4_getattr(struct mnt_idmap *idmap, const struct path *path,
 		}
 	}
 
+	if (S_ISREG(inode->i_mode) && (request_mask & STATX_WRITE_ATOMIC)) {
+		struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+		unsigned int awu_min, awu_max;
+
+		if (ext4_test_mount_flag(inode->i_sb, EXT4_MF_ATOMIC_WRITE)) {
+			awu_min = sbi->fs_awu_min;
+			awu_max = sbi->fs_awu_max;
+		} else {
+			awu_min = awu_max = 0;
+		}
+
+		generic_fill_statx_atomic_writes(stat, awu_min, awu_max);
+	}
+
 	flags = ei->i_flags & EXT4_FL_USER_VISIBLE;
 	if (flags & EXT4_APPEND_FL)
 		stat->attributes |= STATX_ATTR_APPEND;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 16a4ce704460..f5c075aff060 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4425,6 +4425,37 @@ static int ext4_handle_clustersize(struct super_block *sb)
 	return 0;
 }
 
+/*
+ * ext4_atomic_write_init: Initializes filesystem min & max atomic write units.
+ * @sb: super block
+ * TODO: Later add support for bigalloc
+ */
+static void ext4_atomic_write_init(struct super_block *sb)
+{
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct block_device *bdev = sb->s_bdev;
+
+	if (!bdev_can_atomic_write(bdev))
+		return;
+
+	if (!ext4_has_feature_extents(sb))
+		return;
+
+	sbi->fs_awu_min = max(sb->s_blocksize,
+			      bdev_atomic_write_unit_min_bytes(bdev));
+	sbi->fs_awu_max = min(sb->s_blocksize,
+			      bdev_atomic_write_unit_max_bytes(bdev));
+	if (sbi->fs_awu_min && sbi->fs_awu_max &&
+			sbi->fs_awu_min <= sbi->fs_awu_max) {
+		ext4_set_mount_flag(sb, EXT4_MF_ATOMIC_WRITE);
+		ext4_msg(sb, KERN_NOTICE, "Supports atomic writes awu_min: %u, awu_max: %u",
+			 sbi->fs_awu_min, sbi->fs_awu_max);
+	} else {
+		sbi->fs_awu_min = 0;
+		sbi->fs_awu_max = 0;
+	}
+}
+
 static void ext4_fast_commit_init(struct super_block *sb)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
@@ -5336,6 +5367,7 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
 
 	spin_lock_init(&sbi->s_bdev_wb_lock);
 
+	ext4_atomic_write_init(sb);
 	ext4_fast_commit_init(sb);
 
 	sb->s_root = NULL;
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 2/6] ext4: Check for atomic writes support in write iter
  2024-10-25  3:45 [PATCH 0/6] ext4: Add atomic write support for DIO Ritesh Harjani (IBM)
  2024-10-25  3:45 ` [PATCH 1/6] ext4: Add statx support for atomic writes Ritesh Harjani (IBM)
@ 2024-10-25  3:45 ` Ritesh Harjani (IBM)
  2024-10-25  9:44   ` John Garry
  2024-10-25  3:45 ` [PATCH 3/6] ext4: Support setting FMODE_CAN_ATOMIC_WRITE Ritesh Harjani (IBM)
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 39+ messages in thread
From: Ritesh Harjani (IBM) @ 2024-10-25  3:45 UTC (permalink / raw)
  To: linux-ext4
  Cc: Theodore Ts'o, Jan Kara, Darrick J . Wong, Christoph Hellwig,
	John Garry, Ojaswin Mujoo, Dave Chinner, linux-kernel, linux-xfs,
	linux-fsdevel, Ritesh Harjani (IBM)

Let's validate using generic_atomic_write_valid() in
ext4_file_write_iter() if the write request has IOCB_ATOMIC set.

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/ext4/file.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index f14aed14b9cf..b06c5d34bbd2 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -692,6 +692,20 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	if (IS_DAX(inode))
 		return ext4_dax_write_iter(iocb, from);
 #endif
+
+	if (iocb->ki_flags & IOCB_ATOMIC) {
+		size_t len = iov_iter_count(from);
+		int ret;
+
+		if (!IS_ALIGNED(len, EXT4_SB(inode->i_sb)->fs_awu_min) ||
+			len > EXT4_SB(inode->i_sb)->fs_awu_max)
+			return -EINVAL;
+
+		ret = generic_atomic_write_valid(iocb, from);
+		if (ret)
+			return ret;
+	}
+
 	if (iocb->ki_flags & IOCB_DIRECT)
 		return ext4_dio_write_iter(iocb, from);
 	else
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 3/6] ext4: Support setting FMODE_CAN_ATOMIC_WRITE
  2024-10-25  3:45 [PATCH 0/6] ext4: Add atomic write support for DIO Ritesh Harjani (IBM)
  2024-10-25  3:45 ` [PATCH 1/6] ext4: Add statx support for atomic writes Ritesh Harjani (IBM)
  2024-10-25  3:45 ` [PATCH 2/6] ext4: Check for atomic writes support in write iter Ritesh Harjani (IBM)
@ 2024-10-25  3:45 ` Ritesh Harjani (IBM)
  2024-10-25  3:45 ` [PATCH 4/6] ext4: Warn if we ever fallback to buffered-io for DIO atomic writes Ritesh Harjani (IBM)
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 39+ messages in thread
From: Ritesh Harjani (IBM) @ 2024-10-25  3:45 UTC (permalink / raw)
  To: linux-ext4
  Cc: Theodore Ts'o, Jan Kara, Darrick J . Wong, Christoph Hellwig,
	John Garry, Ojaswin Mujoo, Dave Chinner, linux-kernel, linux-xfs,
	linux-fsdevel, Ritesh Harjani (IBM)

FS needs to add the fmode capability in order to support atomic writes
during file open (refer kiocb_set_rw_flags()). Let's check if inode can
support atomic writes then enable FMODE_CAN_ATOMIC_WRITE.

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/ext4/ext4.h | 7 +++++++
 fs/ext4/file.c | 3 +++
 2 files changed, 10 insertions(+)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index a41e56c2c628..78475ff14aa2 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3863,6 +3863,13 @@ static inline int ext4_buffer_uptodate(struct buffer_head *bh)
 extern int ext4_block_write_begin(handle_t *handle, struct folio *folio,
 				  loff_t pos, unsigned len,
 				  get_block_t *get_block);
+
+static inline bool ext4_inode_can_atomicwrite(struct inode *inode)
+{
+	return  ext4_test_mount_flag(inode->i_sb, EXT4_MF_ATOMIC_WRITE) &&
+			S_ISREG(inode->i_mode);
+}
+
 #endif	/* __KERNEL__ */
 
 #define EFSBADCRC	EBADMSG		/* Bad CRC detected */
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index b06c5d34bbd2..f9516121a036 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -898,6 +898,9 @@ static int ext4_file_open(struct inode *inode, struct file *filp)
 			return ret;
 	}
 
+	if (ext4_inode_can_atomicwrite(inode))
+		filp->f_mode |= FMODE_CAN_ATOMIC_WRITE;
+
 	filp->f_mode |= FMODE_NOWAIT | FMODE_CAN_ODIRECT;
 	return dquot_file_open(inode, filp);
 }
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 4/6] ext4: Warn if we ever fallback to buffered-io for DIO atomic writes
  2024-10-25  3:45 [PATCH 0/6] ext4: Add atomic write support for DIO Ritesh Harjani (IBM)
                   ` (2 preceding siblings ...)
  2024-10-25  3:45 ` [PATCH 3/6] ext4: Support setting FMODE_CAN_ATOMIC_WRITE Ritesh Harjani (IBM)
@ 2024-10-25  3:45 ` Ritesh Harjani (IBM)
  2024-10-25 16:16   ` Darrick J. Wong
  2024-10-27 22:26   ` Dave Chinner
  2024-10-25  3:45 ` [PATCH 5/6] iomap: Lift blocksize restriction on " Ritesh Harjani (IBM)
  2024-10-25  3:45 ` [PATCH 6/6] ext4: Add atomic write support for bigalloc Ritesh Harjani (IBM)
  5 siblings, 2 replies; 39+ messages in thread
From: Ritesh Harjani (IBM) @ 2024-10-25  3:45 UTC (permalink / raw)
  To: linux-ext4
  Cc: Theodore Ts'o, Jan Kara, Darrick J . Wong, Christoph Hellwig,
	John Garry, Ojaswin Mujoo, Dave Chinner, linux-kernel, linux-xfs,
	linux-fsdevel, Ritesh Harjani (IBM)

iomap will not return -ENOTBLK in case of dio atomic writes. But let's
also add a WARN_ON_ONCE and return -EIO as a safety net.

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/ext4/file.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index f9516121a036..af6ebd0ac0d6 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -576,8 +576,16 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		iomap_ops = &ext4_iomap_overwrite_ops;
 	ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops,
 			   dio_flags, NULL, 0);
-	if (ret == -ENOTBLK)
+	if (ret == -ENOTBLK) {
 		ret = 0;
+		/*
+		 * iomap will never return -ENOTBLK if write fails for atomic
+		 * write. But let's just add a safety net.
+		 */
+		if (WARN_ON_ONCE(iocb->ki_flags & IOCB_ATOMIC))
+			ret = -EIO;
+	}
+
 	if (extend) {
 		/*
 		 * We always perform extending DIO write synchronously so by
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 5/6] iomap: Lift blocksize restriction on atomic writes
  2024-10-25  3:45 [PATCH 0/6] ext4: Add atomic write support for DIO Ritesh Harjani (IBM)
                   ` (3 preceding siblings ...)
  2024-10-25  3:45 ` [PATCH 4/6] ext4: Warn if we ever fallback to buffered-io for DIO atomic writes Ritesh Harjani (IBM)
@ 2024-10-25  3:45 ` Ritesh Harjani (IBM)
  2024-10-25  8:52   ` John Garry
  2024-10-25  3:45 ` [PATCH 6/6] ext4: Add atomic write support for bigalloc Ritesh Harjani (IBM)
  5 siblings, 1 reply; 39+ messages in thread
From: Ritesh Harjani (IBM) @ 2024-10-25  3:45 UTC (permalink / raw)
  To: linux-ext4
  Cc: Theodore Ts'o, Jan Kara, Darrick J . Wong, Christoph Hellwig,
	John Garry, Ojaswin Mujoo, Dave Chinner, linux-kernel, linux-xfs,
	linux-fsdevel, Ritesh Harjani (IBM)

Filesystems like ext4 can submit writes in multiples of blocksizes.
But we still can't allow the writes to be split. Hence let's check if
the iomap_length() is same as iter->len or not.

This shouldn't affect XFS since it anyways checks for this in
xfs_file_write_iter() to not support atomic write size request of more
than FS blocksize.

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/iomap/direct-io.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index ed4764e3b8f0..1d33b4239b3e 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -306,7 +306,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 	size_t copied = 0;
 	size_t orig_count;
 
-	if (atomic && length != fs_block_size)
+	if (atomic && length != iter->len)
 		return -EINVAL;
 
 	if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 6/6] ext4: Add atomic write support for bigalloc
  2024-10-25  3:45 [PATCH 0/6] ext4: Add atomic write support for DIO Ritesh Harjani (IBM)
                   ` (4 preceding siblings ...)
  2024-10-25  3:45 ` [PATCH 5/6] iomap: Lift blocksize restriction on " Ritesh Harjani (IBM)
@ 2024-10-25  3:45 ` Ritesh Harjani (IBM)
  5 siblings, 0 replies; 39+ messages in thread
From: Ritesh Harjani (IBM) @ 2024-10-25  3:45 UTC (permalink / raw)
  To: linux-ext4
  Cc: Theodore Ts'o, Jan Kara, Darrick J . Wong, Christoph Hellwig,
	John Garry, Ojaswin Mujoo, Dave Chinner, linux-kernel, linux-xfs,
	linux-fsdevel, Ritesh Harjani (IBM)

EXT4 supports bigalloc feature which allows the FS to work in size of
clusters (group of blocks) rather than individual blocks. This patch
adds atomic write support for bigalloc so that systems with bs = ps can
create FS using -
    mkfs.ext4 -F -O bigalloc -b 4096 -C 16384 <dev>

EXT4 can then adjust it's atomic write unit max value to cluster size.
This can then support atomic write of size anywhere between
[blocksize, clustersize].

Note: bigalloc can support writes of the pattern [0 16k] followed by [0 8k].
However, if there is a write pattern detected of type [0 8k] followed by
[0 16k], then we return an error (-EINVAL). It is ok to return an error here
to avoid splitting of atomic write request. This is ok because anyways
atomic write requests has many constraints to follow for e.g. writes of
form which does not follow natural alignment [4k, 12k] ([start, end]) can
also return -EINVAL (check generic_atomic_write_valid()).
Hence the current patch adds the base support needed to support
atomic writes with bigalloc. This is helpful for systems with 4k
pagesize to support atomic writes.

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/ext4/inode.c | 13 +++++++++++++
 fs/ext4/super.c |  9 +++++++--
 2 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 897c028d5bc9..2dee8514d2f8 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3423,6 +3423,19 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
 	 */
 	map.m_len = fscrypt_limit_io_blocks(inode, map.m_lblk, map.m_len);
 
+	/*
+	 * [0 16k] followed by [0 8k] can work with bigalloc. However,
+	 * For now we don't support atomic writes of the pattern
+	 * [0 8k] followed by [0 16k] in case of bigalloc. This is because it
+	 * can cause the atomic writes to split in the iomap layer.
+	 * Atomic writes anyways has many constraints, so as a base support to
+	 * enable atomic writes using bigalloc, it is ok to return an error for
+	 * an unsupported write request.
+	 */
+	if (flags & IOMAP_ATOMIC) {
+		if (map.m_len < (length >> blkbits))
+			return -EINVAL;
+	}
 	ext4_set_iomap(inode, iomap, &map, offset, length, flags);
 
 	return 0;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index f5c075aff060..eba16989ce36 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4428,12 +4428,14 @@ static int ext4_handle_clustersize(struct super_block *sb)
 /*
  * ext4_atomic_write_init: Initializes filesystem min & max atomic write units.
  * @sb: super block
- * TODO: Later add support for bigalloc
  */
 static void ext4_atomic_write_init(struct super_block *sb)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
 	struct block_device *bdev = sb->s_bdev;
+	unsigned int blkbits = sb->s_blocksize_bits;
+	unsigned int clustersize = sb->s_blocksize;
+
 
 	if (!bdev_can_atomic_write(bdev))
 		return;
@@ -4441,9 +4443,12 @@ static void ext4_atomic_write_init(struct super_block *sb)
 	if (!ext4_has_feature_extents(sb))
 		return;
 
+	if (ext4_has_feature_bigalloc(sb))
+		clustersize = 1U << (sbi->s_cluster_bits + blkbits);
+
 	sbi->fs_awu_min = max(sb->s_blocksize,
 			      bdev_atomic_write_unit_min_bytes(bdev));
-	sbi->fs_awu_max = min(sb->s_blocksize,
+	sbi->fs_awu_max = min(clustersize,
 			      bdev_atomic_write_unit_max_bytes(bdev));
 	if (sbi->fs_awu_min && sbi->fs_awu_max &&
 			sbi->fs_awu_min <= sbi->fs_awu_max) {
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/6] iomap: Lift blocksize restriction on atomic writes
  2024-10-25  3:45 ` [PATCH 5/6] iomap: Lift blocksize restriction on " Ritesh Harjani (IBM)
@ 2024-10-25  8:52   ` John Garry
  2024-10-25  9:31     ` Ritesh Harjani
  0 siblings, 1 reply; 39+ messages in thread
From: John Garry @ 2024-10-25  8:52 UTC (permalink / raw)
  To: Ritesh Harjani (IBM), linux-ext4
  Cc: Theodore Ts'o, Jan Kara, Darrick J . Wong, Christoph Hellwig,
	Ojaswin Mujoo, Dave Chinner, linux-kernel, linux-xfs,
	linux-fsdevel

On 25/10/2024 04:45, Ritesh Harjani (IBM) wrote:
> Filesystems like ext4 can submit writes in multiples of blocksizes.
> But we still can't allow the writes to be split. Hence let's check if
> the iomap_length() is same as iter->len or not.
> 
> This shouldn't affect XFS since it anyways checks for this in
> xfs_file_write_iter() to not support atomic write size request of more
> than FS blocksize.
> 
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> ---
>   fs/iomap/direct-io.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index ed4764e3b8f0..1d33b4239b3e 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -306,7 +306,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>   	size_t copied = 0;
>   	size_t orig_count;
>   
> -	if (atomic && length != fs_block_size)
> +	if (atomic && length != iter->len)
>   		return -EINVAL;

Here you expect just one iter for an atomic write always.

In 6/6, you are saying that iomap does not allow an atomic write which 
covers unwritten and written extents, right?

But for writing a single fs block atomically, we don't mandate it to be 
in unwritten state. So there is a difference in behavior in writing a 
single FS block vs multiple FS blocks atomically.

So we have 3x choices, as I see:
a. add a check now in iomap that the extent is in written state (for an 
atomic write)
b. add extent zeroing code, as I was trying for originally
c. document this peculiarity

Thanks,
John


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/6] iomap: Lift blocksize restriction on atomic writes
  2024-10-25  8:52   ` John Garry
@ 2024-10-25  9:31     ` Ritesh Harjani
  2024-10-25  9:59       ` John Garry
  0 siblings, 1 reply; 39+ messages in thread
From: Ritesh Harjani @ 2024-10-25  9:31 UTC (permalink / raw)
  To: John Garry, linux-ext4
  Cc: Theodore Ts'o, Jan Kara, Darrick J . Wong, Christoph Hellwig,
	Ojaswin Mujoo, Dave Chinner, linux-kernel, linux-xfs,
	linux-fsdevel


Hi John, 

John Garry <john.g.garry@oracle.com> writes:

> On 25/10/2024 04:45, Ritesh Harjani (IBM) wrote:
>> Filesystems like ext4 can submit writes in multiples of blocksizes.
>> But we still can't allow the writes to be split. Hence let's check if
>> the iomap_length() is same as iter->len or not.
>> 
>> This shouldn't affect XFS since it anyways checks for this in
>> xfs_file_write_iter() to not support atomic write size request of more
>> than FS blocksize.
>> 
>> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>> ---
>>   fs/iomap/direct-io.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
>> index ed4764e3b8f0..1d33b4239b3e 100644
>> --- a/fs/iomap/direct-io.c
>> +++ b/fs/iomap/direct-io.c
>> @@ -306,7 +306,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>>   	size_t copied = 0;
>>   	size_t orig_count;
>>   
>> -	if (atomic && length != fs_block_size)
>> +	if (atomic && length != iter->len)
>>   		return -EINVAL;
>
> Here you expect just one iter for an atomic write always.

Here we are lifting the limitation of iomap to support entire iter->len
rather than just 1 fsblock. 

>
> In 6/6, you are saying that iomap does not allow an atomic write which 
> covers unwritten and written extents, right?

No, it's not that. If FS does not provide a full mapping to iomap in
->iomap_begin then the writes will get split. For atomic writes this
should not happen, hence the check in iomap above to return -EINVAL if
mapped length does not match iter->len.

>
> But for writing a single fs block atomically, we don't mandate it to be 
> in unwritten state. So there is a difference in behavior in writing a 
> single FS block vs multiple FS blocks atomically.

Same as mentioned above. We can't have atomic writes to get split.
This patch is just lifting the restriction of iomap to allow more than
blocksize but the mapped length should still meet iter->len, as
otherwise the writes can get split.

>
> So we have 3x choices, as I see:
> a. add a check now in iomap that the extent is in written state (for an 
> atomic write)
> b. add extent zeroing code, as I was trying for originally
> c. document this peculiarity
>
> Thanks,
> John

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/6] ext4: Add statx support for atomic writes
  2024-10-25  3:45 ` [PATCH 1/6] ext4: Add statx support for atomic writes Ritesh Harjani (IBM)
@ 2024-10-25  9:41   ` John Garry
  2024-10-25 10:08     ` Ritesh Harjani
  0 siblings, 1 reply; 39+ messages in thread
From: John Garry @ 2024-10-25  9:41 UTC (permalink / raw)
  To: Ritesh Harjani (IBM), linux-ext4
  Cc: Theodore Ts'o, Jan Kara, Darrick J . Wong, Christoph Hellwig,
	Ojaswin Mujoo, Dave Chinner, linux-kernel, linux-xfs,
	linux-fsdevel

On 25/10/2024 04:45, Ritesh Harjani (IBM) wrote:
> This patch adds base support for atomic writes via statx getattr.
> On bs < ps systems, we can create FS with say bs of 16k. That means
> both atomic write min and max unit can be set to 16k for supporting
> atomic writes.
> 
> Later patches adds support for bigalloc as well so that ext4 can also
> support doing atomic writes for bs = ps systems.
> 
> Co-developed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> ---
>   fs/ext4/ext4.h  |  7 ++++++-
>   fs/ext4/inode.c | 14 ++++++++++++++
>   fs/ext4/super.c | 32 ++++++++++++++++++++++++++++++++
>   3 files changed, 52 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 44b0d418143c..a41e56c2c628 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1729,6 +1729,10 @@ struct ext4_sb_info {
>   	 */
>   	struct work_struct s_sb_upd_work;
>   
> +	/* Atomic write unit values */
> +	unsigned int fs_awu_min;
> +	unsigned int fs_awu_max;
> +
>   	/* Ext4 fast commit sub transaction ID */
>   	atomic_t s_fc_subtid;
>   
> @@ -1820,7 +1824,8 @@ static inline int ext4_valid_inum(struct super_block *sb, unsigned long ino)
>    */
>   enum {
>   	EXT4_MF_MNTDIR_SAMPLED,
> -	EXT4_MF_FC_INELIGIBLE	/* Fast commit ineligible */
> +	EXT4_MF_FC_INELIGIBLE,	/* Fast commit ineligible */
> +	EXT4_MF_ATOMIC_WRITE	/* Supports atomic write */

Does this flag really buy us much?

>   };
>   
>   static inline void ext4_set_mount_flag(struct super_block *sb, int bit)
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 54bdd4884fe6..897c028d5bc9 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -5578,6 +5578,20 @@ int ext4_getattr(struct mnt_idmap *idmap, const struct path *path,
>   		}
>   	}
>   
> +	if (S_ISREG(inode->i_mode) && (request_mask & STATX_WRITE_ATOMIC)) {
> +		struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
> +		unsigned int awu_min, awu_max;
> +
> +		if (ext4_test_mount_flag(inode->i_sb, EXT4_MF_ATOMIC_WRITE)) {

I'd use ext4_inode_can_atomicwrite() here, similar to what is done for xfs

> +			awu_min = sbi->fs_awu_min;
> +			awu_max = sbi->fs_awu_max;
> +		} else {
> +			awu_min = awu_max = 0;
> +		}
> +
> +		generic_fill_statx_atomic_writes(stat, awu_min, awu_max);
> +	}
> +
>   	flags = ei->i_flags & EXT4_FL_USER_VISIBLE;
>   	if (flags & EXT4_APPEND_FL)
>   		stat->attributes |= STATX_ATTR_APPEND;
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 16a4ce704460..f5c075aff060 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -4425,6 +4425,37 @@ static int ext4_handle_clustersize(struct super_block *sb)
>   	return 0;
>   }
>   
> +/*


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 2/6] ext4: Check for atomic writes support in write iter
  2024-10-25  3:45 ` [PATCH 2/6] ext4: Check for atomic writes support in write iter Ritesh Harjani (IBM)
@ 2024-10-25  9:44   ` John Garry
  2024-10-25 10:33     ` Ritesh Harjani
  0 siblings, 1 reply; 39+ messages in thread
From: John Garry @ 2024-10-25  9:44 UTC (permalink / raw)
  To: Ritesh Harjani (IBM), linux-ext4
  Cc: Theodore Ts'o, Jan Kara, Darrick J . Wong, Christoph Hellwig,
	Ojaswin Mujoo, Dave Chinner, linux-kernel, linux-xfs,
	linux-fsdevel

On 25/10/2024 04:45, Ritesh Harjani (IBM) wrote:
> Let's validate using generic_atomic_write_valid() in
> ext4_file_write_iter() if the write request has IOCB_ATOMIC set.
> 
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> ---
>   fs/ext4/file.c | 14 ++++++++++++++
>   1 file changed, 14 insertions(+)
> 
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index f14aed14b9cf..b06c5d34bbd2 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -692,6 +692,20 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>   	if (IS_DAX(inode))
>   		return ext4_dax_write_iter(iocb, from);
>   #endif
> +
> +	if (iocb->ki_flags & IOCB_ATOMIC) {
> +		size_t len = iov_iter_count(from);
> +		int ret;
> +
> +		if (!IS_ALIGNED(len, EXT4_SB(inode->i_sb)->fs_awu_min) ||
> +			len > EXT4_SB(inode->i_sb)->fs_awu_max)
> +			return -EINVAL;

this looks ok, but the IS_ALIGNED() check looks odd. I am not sure why 
you don't just check that fs_awu_max >= len >= fs_awu_min

> +
> +		ret = generic_atomic_write_valid(iocb, from);
> +		if (ret)
> +			return ret;
> +	}
> +
>   	if (iocb->ki_flags & IOCB_DIRECT)
>   		return ext4_dio_write_iter(iocb, from);
>   	else


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/6] iomap: Lift blocksize restriction on atomic writes
  2024-10-25  9:31     ` Ritesh Harjani
@ 2024-10-25  9:59       ` John Garry
  2024-10-25 10:35         ` Ritesh Harjani
  0 siblings, 1 reply; 39+ messages in thread
From: John Garry @ 2024-10-25  9:59 UTC (permalink / raw)
  To: Ritesh Harjani (IBM), linux-ext4
  Cc: Theodore Ts'o, Jan Kara, Darrick J . Wong, Christoph Hellwig,
	Ojaswin Mujoo, Dave Chinner, linux-kernel, linux-xfs,
	linux-fsdevel

On 25/10/2024 10:31, Ritesh Harjani (IBM) wrote:
>>>    
>>> -	if (atomic && length != fs_block_size)
>>> +	if (atomic && length != iter->len)
>>>    		return -EINVAL;
>> Here you expect just one iter for an atomic write always.
> Here we are lifting the limitation of iomap to support entire iter->len
> rather than just 1 fsblock.

Sure

> 
>> In 6/6, you are saying that iomap does not allow an atomic write which
>> covers unwritten and written extents, right?
> No, it's not that. If FS does not provide a full mapping to iomap in
> ->iomap_begin then the writes will get split. 

but why would it provide multiple mapping?

> For atomic writes this
> should not happen, hence the check in iomap above to return -EINVAL if
> mapped length does not match iter->len.
> 
>> But for writing a single fs block atomically, we don't mandate it to be
>> in unwritten state. So there is a difference in behavior in writing a
>> single FS block vs multiple FS blocks atomically.
> Same as mentioned above. We can't have atomic writes to get split.
> This patch is just lifting the restriction of iomap to allow more than
> blocksize but the mapped length should still meet iter->len, as
> otherwise the writes can get split.

Sure, I get this. But I wonder why would we be getting multiple 
mappings? Why cannot the FS always provide a single mapping?

Thanks,
John


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/6] ext4: Add statx support for atomic writes
  2024-10-25  9:41   ` John Garry
@ 2024-10-25 10:08     ` Ritesh Harjani
  2024-10-25 16:09       ` Darrick J. Wong
  0 siblings, 1 reply; 39+ messages in thread
From: Ritesh Harjani @ 2024-10-25 10:08 UTC (permalink / raw)
  To: John Garry, linux-ext4
  Cc: Theodore Ts'o, Jan Kara, Darrick J . Wong, Christoph Hellwig,
	Ojaswin Mujoo, Dave Chinner, linux-kernel, linux-xfs,
	linux-fsdevel

John Garry <john.g.garry@oracle.com> writes:

> On 25/10/2024 04:45, Ritesh Harjani (IBM) wrote:
>> This patch adds base support for atomic writes via statx getattr.
>> On bs < ps systems, we can create FS with say bs of 16k. That means
>> both atomic write min and max unit can be set to 16k for supporting
>> atomic writes.
>> 
>> Later patches adds support for bigalloc as well so that ext4 can also
>> support doing atomic writes for bs = ps systems.
>> 
>> Co-developed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
>> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
>> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>> ---
>>   fs/ext4/ext4.h  |  7 ++++++-
>>   fs/ext4/inode.c | 14 ++++++++++++++
>>   fs/ext4/super.c | 32 ++++++++++++++++++++++++++++++++
>>   3 files changed, 52 insertions(+), 1 deletion(-)
>> 
>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>> index 44b0d418143c..a41e56c2c628 100644
>> --- a/fs/ext4/ext4.h
>> +++ b/fs/ext4/ext4.h
>> @@ -1729,6 +1729,10 @@ struct ext4_sb_info {
>>   	 */
>>   	struct work_struct s_sb_upd_work;
>>   
>> +	/* Atomic write unit values */
>> +	unsigned int fs_awu_min;
>> +	unsigned int fs_awu_max;
>> +
>>   	/* Ext4 fast commit sub transaction ID */
>>   	atomic_t s_fc_subtid;
>>   
>> @@ -1820,7 +1824,8 @@ static inline int ext4_valid_inum(struct super_block *sb, unsigned long ino)
>>    */
>>   enum {
>>   	EXT4_MF_MNTDIR_SAMPLED,
>> -	EXT4_MF_FC_INELIGIBLE	/* Fast commit ineligible */
>> +	EXT4_MF_FC_INELIGIBLE,	/* Fast commit ineligible */
>> +	EXT4_MF_ATOMIC_WRITE	/* Supports atomic write */
>
> Does this flag really buy us much?
>

I felt it is cleaner this way than comparing non-zero values of
fs_awu_min and fs_awu_max.


Now that you pointed at it - Maybe a question for others who might have
the history of which one to use when - or do we think there is a scope
of merging the two into just one as a later cleanup?

I know that s_mount_flags was added for fastcommit and it needed the
state manipulations to be done in atomic way. Similarly s_ext4_flags
also was renamed from s_resize_flags for more general purpose use. Both
of these looks like could be merged isn't it?



>>   };
>>   
>>   static inline void ext4_set_mount_flag(struct super_block *sb, int bit)
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index 54bdd4884fe6..897c028d5bc9 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -5578,6 +5578,20 @@ int ext4_getattr(struct mnt_idmap *idmap, const struct path *path,
>>   		}
>>   	}
>>   
>> +	if (S_ISREG(inode->i_mode) && (request_mask & STATX_WRITE_ATOMIC)) {
>> +		struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
>> +		unsigned int awu_min, awu_max;
>> +
>> +		if (ext4_test_mount_flag(inode->i_sb, EXT4_MF_ATOMIC_WRITE)) {
>
> I'd use ext4_inode_can_atomicwrite() here, similar to what is done for xfs
>

Sure since it is inode operation, we can check against ext4_inode_can_atomicwrite().


>> +			awu_min = sbi->fs_awu_min;
>> +			awu_max = sbi->fs_awu_max;
>> +		} else {
>> +			awu_min = awu_max = 0;
>> +		}
>> +
>> +		generic_fill_statx_atomic_writes(stat, awu_min, awu_max);
>> +	}
>> +
>>   	flags = ei->i_flags & EXT4_FL_USER_VISIBLE;
>>   	if (flags & EXT4_APPEND_FL)
>>   		stat->attributes |= STATX_ATTR_APPEND;
>> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
>> index 16a4ce704460..f5c075aff060 100644
>> --- a/fs/ext4/super.c
>> +++ b/fs/ext4/super.c
>> @@ -4425,6 +4425,37 @@ static int ext4_handle_clustersize(struct super_block *sb)
>>   	return 0;
>>   }
>>   
>> +/*

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 2/6] ext4: Check for atomic writes support in write iter
  2024-10-25  9:44   ` John Garry
@ 2024-10-25 10:33     ` Ritesh Harjani
  2024-10-25 16:11       ` Darrick J. Wong
  0 siblings, 1 reply; 39+ messages in thread
From: Ritesh Harjani @ 2024-10-25 10:33 UTC (permalink / raw)
  To: John Garry, linux-ext4
  Cc: Theodore Ts'o, Jan Kara, Darrick J . Wong, Christoph Hellwig,
	Ojaswin Mujoo, Dave Chinner, linux-kernel, linux-xfs,
	linux-fsdevel

John Garry <john.g.garry@oracle.com> writes:

> On 25/10/2024 04:45, Ritesh Harjani (IBM) wrote:
>> Let's validate using generic_atomic_write_valid() in
>> ext4_file_write_iter() if the write request has IOCB_ATOMIC set.
>> 
>> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>> ---
>>   fs/ext4/file.c | 14 ++++++++++++++
>>   1 file changed, 14 insertions(+)
>> 
>> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
>> index f14aed14b9cf..b06c5d34bbd2 100644
>> --- a/fs/ext4/file.c
>> +++ b/fs/ext4/file.c
>> @@ -692,6 +692,20 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>>   	if (IS_DAX(inode))
>>   		return ext4_dax_write_iter(iocb, from);
>>   #endif
>> +
>> +	if (iocb->ki_flags & IOCB_ATOMIC) {
>> +		size_t len = iov_iter_count(from);
>> +		int ret;
>> +
>> +		if (!IS_ALIGNED(len, EXT4_SB(inode->i_sb)->fs_awu_min) ||
>> +			len > EXT4_SB(inode->i_sb)->fs_awu_max)
>> +			return -EINVAL;
>
> this looks ok, but the IS_ALIGNED() check looks odd. I am not sure why 
> you don't just check that fs_awu_max >= len >= fs_awu_min
>

I guess this was just a stricter check. But we anyways have power_of_2
and other checks in generic_atomic_write_valid(). So it does not matter. 

I can change this in v2. 

Thanks!

>> +
>> +		ret = generic_atomic_write_valid(iocb, from);
>> +		if (ret)
>> +			return ret;
>> +	}
>> +
>>   	if (iocb->ki_flags & IOCB_DIRECT)
>>   		return ext4_dio_write_iter(iocb, from);
>>   	else

-ritesh

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/6] iomap: Lift blocksize restriction on atomic writes
  2024-10-25  9:59       ` John Garry
@ 2024-10-25 10:35         ` Ritesh Harjani
  2024-10-25 11:07           ` John Garry
  0 siblings, 1 reply; 39+ messages in thread
From: Ritesh Harjani @ 2024-10-25 10:35 UTC (permalink / raw)
  To: John Garry, linux-ext4
  Cc: Theodore Ts'o, Jan Kara, Darrick J . Wong, Christoph Hellwig,
	Ojaswin Mujoo, Dave Chinner, linux-kernel, linux-xfs,
	linux-fsdevel

John Garry <john.g.garry@oracle.com> writes:

> On 25/10/2024 10:31, Ritesh Harjani (IBM) wrote:
>>>>    
>>>> -	if (atomic && length != fs_block_size)
>>>> +	if (atomic && length != iter->len)
>>>>    		return -EINVAL;
>>> Here you expect just one iter for an atomic write always.
>> Here we are lifting the limitation of iomap to support entire iter->len
>> rather than just 1 fsblock.
>
> Sure
>
>> 
>>> In 6/6, you are saying that iomap does not allow an atomic write which
>>> covers unwritten and written extents, right?
>> No, it's not that. If FS does not provide a full mapping to iomap in
>> ->iomap_begin then the writes will get split. 
>
> but why would it provide multiple mapping?
>
>> For atomic writes this
>> should not happen, hence the check in iomap above to return -EINVAL if
>> mapped length does not match iter->len.
>> 
>>> But for writing a single fs block atomically, we don't mandate it to be
>>> in unwritten state. So there is a difference in behavior in writing a
>>> single FS block vs multiple FS blocks atomically.
>> Same as mentioned above. We can't have atomic writes to get split.
>> This patch is just lifting the restriction of iomap to allow more than
>> blocksize but the mapped length should still meet iter->len, as
>> otherwise the writes can get split.
>
> Sure, I get this. But I wonder why would we be getting multiple 
> mappings? Why cannot the FS always provide a single mapping?

FS can decide to split the mappings when it couldn't allocate a single
large mapping of the requested length. Could be due to - 
- already allocated extent followed by EOF, 
- already allocated extent followed by a hole
- already mapped extent followed by an extent of different type (e.g. written followed by unwritten or unwritten followed by written)
- delalloc (not delalloc since we invalidate respective page cache pages before doing DIO).
- fragmentation or ENOSPC - For ext4 bigalloc this will not happen since
we reserve the entire cluster. So we know there should be space. But I
am not sure how other filesystems might end up implementing this functionality.

Thanks!

-ritesh

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/6] iomap: Lift blocksize restriction on atomic writes
  2024-10-25 10:35         ` Ritesh Harjani
@ 2024-10-25 11:07           ` John Garry
  2024-10-25 11:19             ` Ritesh Harjani
  0 siblings, 1 reply; 39+ messages in thread
From: John Garry @ 2024-10-25 11:07 UTC (permalink / raw)
  To: Ritesh Harjani (IBM), linux-ext4
  Cc: Theodore Ts'o, Jan Kara, Darrick J . Wong, Christoph Hellwig,
	Ojaswin Mujoo, Dave Chinner, linux-kernel, linux-xfs,
	linux-fsdevel

On 25/10/2024 11:35, Ritesh Harjani (IBM) wrote:
>>> Same as mentioned above. We can't have atomic writes to get split.
>>> This patch is just lifting the restriction of iomap to allow more than
>>> blocksize but the mapped length should still meet iter->len, as
>>> otherwise the writes can get split.
>> Sure, I get this. But I wonder why would we be getting multiple
>> mappings? Why cannot the FS always provide a single mapping?
> FS can decide to split the mappings when it couldn't allocate a single
> large mapping of the requested length. Could be due to -
> - already allocated extent followed by EOF,
> - already allocated extent followed by a hole
> - already mapped extent followed by an extent of different type (e.g. written followed by unwritten or unwritten followed by written)

This is the sort of scenario which I am concerned with. This issue has 
been discussed at length for XFS forcealign support for atomic writes.

So far, the user can atomic write a single FS block regardless of 
whether the extent in which it would be part of is in written or 
unwritten state.

Now the rule will be to write multiple FS blocks atomically, all blocks 
need to be in same written or unwritten state.

This oddity at least needs to be documented.

Better yet would be to not have this restriction.

> - delalloc (not delalloc since we invalidate respective page cache pages before doing DIO).
> - fragmentation or ENOSPC - For ext4 bigalloc this will not happen since
> we reserve the entire cluster. So we know there should be space. But I
> am not sure how other filesystems might end up implementing this functionality.

Thanks,
John


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/6] iomap: Lift blocksize restriction on atomic writes
  2024-10-25 11:07           ` John Garry
@ 2024-10-25 11:19             ` Ritesh Harjani
  2024-10-25 12:23               ` John Garry
  0 siblings, 1 reply; 39+ messages in thread
From: Ritesh Harjani @ 2024-10-25 11:19 UTC (permalink / raw)
  To: John Garry, linux-ext4
  Cc: Theodore Ts'o, Jan Kara, Darrick J . Wong, Christoph Hellwig,
	Ojaswin Mujoo, Dave Chinner, linux-kernel, linux-xfs,
	linux-fsdevel

John Garry <john.g.garry@oracle.com> writes:

> On 25/10/2024 11:35, Ritesh Harjani (IBM) wrote:
>>>> Same as mentioned above. We can't have atomic writes to get split.
>>>> This patch is just lifting the restriction of iomap to allow more than
>>>> blocksize but the mapped length should still meet iter->len, as
>>>> otherwise the writes can get split.
>>> Sure, I get this. But I wonder why would we be getting multiple
>>> mappings? Why cannot the FS always provide a single mapping?
>> FS can decide to split the mappings when it couldn't allocate a single
>> large mapping of the requested length. Could be due to -
>> - already allocated extent followed by EOF,
>> - already allocated extent followed by a hole
>> - already mapped extent followed by an extent of different type (e.g. written followed by unwritten or unwritten followed by written)
>
> This is the sort of scenario which I am concerned with. This issue has 
> been discussed at length for XFS forcealign support for atomic writes.

extsize and forcealign is being worked for ext4 as well where we can
add such support, sure.

>
> So far, the user can atomic write a single FS block regardless of 
> whether the extent in which it would be part of is in written or 
> unwritten state.
>
> Now the rule will be to write multiple FS blocks atomically, all blocks 
> need to be in same written or unwritten state.

FS needs to ensure that the writes does not get torned. So for whatever reason
FS splits the mapping then we need to return an -EINVAL error to not
allow such writes to get torned. This patch just does that.

But I get your point. More below.

>
> This oddity at least needs to be documented.

Got it. Yes, we can do that.

>
> Better yet would be to not have this restriction.
>

I haven't thought of a clever way where we don't have to zero out the
rest of the unwritten mapping. With ext4 bigalloc since the entire
cluster is anyway reserved - I was thinking if we can come up with a
clever way for doing atomic writes to the entire user requested size w/o
zeroing out.

Zeroing out the other unwritten extent is also a cost penalty to the
user anyways. So user will anyway will have to be made aware of not to
attempt writes of fashion which can cause them such penalties. 

As patch-6 mentions this is a base support for bs = ps systems for
enabling atomic writes using bigalloc. For now we return -EINVAL when we
can't allocate a continuous user requested mapping which means it won't
support operations of types 8k followed by 16k.

We can document this behavior as other things are documented for atomic
writes on ext4 with bigalloc e.g. pow_of_2 length writes, natural
alignment w.r.t pos and length etc.

Does that sound ok?

>> - delalloc (not delalloc since we invalidate respective page cache pages before doing DIO).
>> - fragmentation or ENOSPC - For ext4 bigalloc this will not happen since
>> we reserve the entire cluster. So we know there should be space. But I
>> am not sure how other filesystems might end up implementing this functionality.
>
> Thanks,
> John

-ritesh

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/6] iomap: Lift blocksize restriction on atomic writes
  2024-10-25 11:19             ` Ritesh Harjani
@ 2024-10-25 12:23               ` John Garry
  2024-10-25 12:36                 ` Ritesh Harjani
  0 siblings, 1 reply; 39+ messages in thread
From: John Garry @ 2024-10-25 12:23 UTC (permalink / raw)
  To: Ritesh Harjani (IBM), linux-ext4
  Cc: Theodore Ts'o, Jan Kara, Darrick J . Wong, Christoph Hellwig,
	Ojaswin Mujoo, Dave Chinner, linux-kernel, linux-xfs,
	linux-fsdevel

On 25/10/2024 12:19, Ritesh Harjani (IBM) wrote:
> John Garry <john.g.garry@oracle.com> writes:
> 
>> On 25/10/2024 11:35, Ritesh Harjani (IBM) wrote:
>>>>> Same as mentioned above. We can't have atomic writes to get split.
>>>>> This patch is just lifting the restriction of iomap to allow more than
>>>>> blocksize but the mapped length should still meet iter->len, as
>>>>> otherwise the writes can get split.
>>>> Sure, I get this. But I wonder why would we be getting multiple
>>>> mappings? Why cannot the FS always provide a single mapping?
>>> FS can decide to split the mappings when it couldn't allocate a single
>>> large mapping of the requested length. Could be due to -
>>> - already allocated extent followed by EOF,
>>> - already allocated extent followed by a hole
>>> - already mapped extent followed by an extent of different type (e.g. written followed by unwritten or unwritten followed by written)
>>
>> This is the sort of scenario which I am concerned with. This issue has
>> been discussed at length for XFS forcealign support for atomic writes.
> 
> extsize and forcealign is being worked for ext4 as well where we can
> add such support, sure.
> 
>>
>> So far, the user can atomic write a single FS block regardless of
>> whether the extent in which it would be part of is in written or
>> unwritten state.
>>
>> Now the rule will be to write multiple FS blocks atomically, all blocks
>> need to be in same written or unwritten state.
> 
> FS needs to ensure that the writes does not get torned. So for whatever reason
> FS splits the mapping then we need to return an -EINVAL error to not
> allow such writes to get torned. This patch just does that.
> 
> But I get your point. More below.
> 
>>
>> This oddity at least needs to be documented.
> 
> Got it. Yes, we can do that.
> 
>>
>> Better yet would be to not have this restriction.
>>
> 
> I haven't thought of a clever way where we don't have to zero out the
> rest of the unwritten mapping. With ext4 bigalloc since the entire
> cluster is anyway reserved - I was thinking if we can come up with a
> clever way for doing atomic writes to the entire user requested size w/o
> zeroing out.

This following was main method which was being attempted:

https://lore.kernel.org/linux-fsdevel/20240429174746.2132161-15-john.g.garry@oracle.com/

There were other ideas in different versions of the forcelign/xfs block 
atomic writes series.

> 
> Zeroing out the other unwritten extent is also a cost penalty to the
> user anyways.

Sure, unless we have a special inode flag to say "pre-zero the extent".

> So user will anyway will have to be made aware of not to
> attempt writes of fashion which can cause them such penalties.
> 
> As patch-6 mentions this is a base support for bs = ps systems for
> enabling atomic writes using bigalloc. For now we return -EINVAL when we
> can't allocate a continuous user requested mapping which means it won't
> support operations of types 8k followed by 16k.
> 

That's my least-preferred option.

I think better would be reject atomic writes that cover unwritten 
extents always - but that boat is about to sail...

Thanks,
John


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/6] iomap: Lift blocksize restriction on atomic writes
  2024-10-25 12:23               ` John Garry
@ 2024-10-25 12:36                 ` Ritesh Harjani
  2024-10-25 14:04                   ` John Garry
  0 siblings, 1 reply; 39+ messages in thread
From: Ritesh Harjani @ 2024-10-25 12:36 UTC (permalink / raw)
  To: John Garry, linux-ext4
  Cc: Theodore Ts'o, Jan Kara, Darrick J . Wong, Christoph Hellwig,
	Ojaswin Mujoo, Dave Chinner, linux-kernel, linux-xfs,
	linux-fsdevel

John Garry <john.g.garry@oracle.com> writes:

> On 25/10/2024 12:19, Ritesh Harjani (IBM) wrote:
>> John Garry <john.g.garry@oracle.com> writes:
>> 
>>> On 25/10/2024 11:35, Ritesh Harjani (IBM) wrote:
>>>>>> Same as mentioned above. We can't have atomic writes to get split.
>>>>>> This patch is just lifting the restriction of iomap to allow more than
>>>>>> blocksize but the mapped length should still meet iter->len, as
>>>>>> otherwise the writes can get split.
>>>>> Sure, I get this. But I wonder why would we be getting multiple
>>>>> mappings? Why cannot the FS always provide a single mapping?
>>>> FS can decide to split the mappings when it couldn't allocate a single
>>>> large mapping of the requested length. Could be due to -
>>>> - already allocated extent followed by EOF,
>>>> - already allocated extent followed by a hole
>>>> - already mapped extent followed by an extent of different type (e.g. written followed by unwritten or unwritten followed by written)
>>>
>>> This is the sort of scenario which I am concerned with. This issue has
>>> been discussed at length for XFS forcealign support for atomic writes.
>> 
>> extsize and forcealign is being worked for ext4 as well where we can
>> add such support, sure.
>> 
>>>
>>> So far, the user can atomic write a single FS block regardless of
>>> whether the extent in which it would be part of is in written or
>>> unwritten state.
>>>
>>> Now the rule will be to write multiple FS blocks atomically, all blocks
>>> need to be in same written or unwritten state.
>> 
>> FS needs to ensure that the writes does not get torned. So for whatever reason
>> FS splits the mapping then we need to return an -EINVAL error to not
>> allow such writes to get torned. This patch just does that.
>> 
>> But I get your point. More below.
>> 
>>>
>>> This oddity at least needs to be documented.
>> 
>> Got it. Yes, we can do that.
>> 
>>>
>>> Better yet would be to not have this restriction.
>>>
>> 
>> I haven't thought of a clever way where we don't have to zero out the
>> rest of the unwritten mapping. With ext4 bigalloc since the entire
>> cluster is anyway reserved - I was thinking if we can come up with a
>> clever way for doing atomic writes to the entire user requested size w/o
>> zeroing out.
>
> This following was main method which was being attempted:
>
> https://lore.kernel.org/linux-fsdevel/20240429174746.2132161-15-john.g.garry@oracle.com/
>
> There were other ideas in different versions of the forcelign/xfs block 
> atomic writes series.
>
>> 
>> Zeroing out the other unwritten extent is also a cost penalty to the
>> user anyways.
>
> Sure, unless we have a special inode flag to say "pre-zero the extent".
>
>> So user will anyway will have to be made aware of not to
>> attempt writes of fashion which can cause them such penalties.
>> 
>> As patch-6 mentions this is a base support for bs = ps systems for
>> enabling atomic writes using bigalloc. For now we return -EINVAL when we
>> can't allocate a continuous user requested mapping which means it won't
>> support operations of types 8k followed by 16k.
>> 
>
> That's my least-preferred option.
>
> I think better would be reject atomic writes that cover unwritten 
> extents always - but that boat is about to sail...

That's what this patch does. For whatever reason if we couldn't allocate
a single contiguous region of requested size for atomic write, then we
reject the request always, isn't it. Or maybe I didn't understand your comment.

If others prefer - we can maybe add such a check (e.g. ext4_dio_atomic_write_checks()) 
for atomic writes in ext4_dio_write_checks(), similar to how we detect
overwrites case to decide whether we need a read v/s write semaphore. 
So this can check if the user has a partially allocated extent for the
user requested region and if yes, we can return -EINVAL from
ext4_dio_write_iter() itself. 

I think this maybe better option than waiting until ->iomap_begin().
This might also bring all atomic write constraints to be checked in one
place i.e. during ext4_file_write_iter() itself.

Thoughts?

-ritesh

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/6] iomap: Lift blocksize restriction on atomic writes
  2024-10-25 12:36                 ` Ritesh Harjani
@ 2024-10-25 14:04                   ` John Garry
  2024-10-25 14:13                     ` Ritesh Harjani
  0 siblings, 1 reply; 39+ messages in thread
From: John Garry @ 2024-10-25 14:04 UTC (permalink / raw)
  To: Ritesh Harjani (IBM), linux-ext4
  Cc: Theodore Ts'o, Jan Kara, Darrick J . Wong, Christoph Hellwig,
	Ojaswin Mujoo, Dave Chinner, linux-kernel, linux-xfs,
	linux-fsdevel

On 25/10/2024 13:36, Ritesh Harjani (IBM) wrote:
>>> So user will anyway will have to be made aware of not to
>>> attempt writes of fashion which can cause them such penalties.
>>>
>>> As patch-6 mentions this is a base support for bs = ps systems for
>>> enabling atomic writes using bigalloc. For now we return -EINVAL when we
>>> can't allocate a continuous user requested mapping which means it won't
>>> support operations of types 8k followed by 16k.
>>>
>> That's my least-preferred option.
>>
>> I think better would be reject atomic writes that cover unwritten
>> extents always - but that boat is about to sail...
> That's what this patch does.

Not really.

Currently we have 2x iomap restrictions:
a. mapping length must equal fs block size
b. bio created must equal total write size

This patch just says that the mapping length must equal total write size 
(instead of a.). So quite similar to b.

> For whatever reason if we couldn't allocate
> a single contiguous region of requested size for atomic write, then we
> reject the request always, isn't it. Or maybe I didn't understand your comment.

As the simplest example, for an atomic write to an empty file, there 
should only be a single mapping returned to iomap_dio_bio_iter() and 
that would be of IOMAP_UNWRITTEN type. And we don't reject that.

> 
> If others prefer - we can maybe add such a check (e.g. ext4_dio_atomic_write_checks())
> for atomic writes in ext4_dio_write_checks(), similar to how we detect
> overwrites case to decide whether we need a read v/s write semaphore.
> So this can check if the user has a partially allocated extent for the
> user requested region and if yes, we can return -EINVAL from
> ext4_dio_write_iter() itself.
 > > I think this maybe better option than waiting until ->iomap_begin().
> This might also bring all atomic write constraints to be checked in one
> place i.e. during ext4_file_write_iter() itself.

Something like this can be done once we decide how atomic writing to 
regions which cover mixed unwritten and written extents is to be handled.

Thanks,
John


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/6] iomap: Lift blocksize restriction on atomic writes
  2024-10-25 14:04                   ` John Garry
@ 2024-10-25 14:13                     ` Ritesh Harjani
  2024-10-25 18:28                       ` Darrick J. Wong
  0 siblings, 1 reply; 39+ messages in thread
From: Ritesh Harjani @ 2024-10-25 14:13 UTC (permalink / raw)
  To: John Garry, linux-ext4
  Cc: Theodore Ts'o, Jan Kara, Darrick J . Wong, Christoph Hellwig,
	Ojaswin Mujoo, Dave Chinner, linux-kernel, linux-xfs,
	linux-fsdevel

John Garry <john.g.garry@oracle.com> writes:

> On 25/10/2024 13:36, Ritesh Harjani (IBM) wrote:
>>>> So user will anyway will have to be made aware of not to
>>>> attempt writes of fashion which can cause them such penalties.
>>>>
>>>> As patch-6 mentions this is a base support for bs = ps systems for
>>>> enabling atomic writes using bigalloc. For now we return -EINVAL when we
>>>> can't allocate a continuous user requested mapping which means it won't
>>>> support operations of types 8k followed by 16k.
>>>>
>>> That's my least-preferred option.
>>>
>>> I think better would be reject atomic writes that cover unwritten
>>> extents always - but that boat is about to sail...
>> That's what this patch does.
>
> Not really.
>
> Currently we have 2x iomap restrictions:
> a. mapping length must equal fs block size
> b. bio created must equal total write size
>
> This patch just says that the mapping length must equal total write size 
> (instead of a.). So quite similar to b.
>
>> For whatever reason if we couldn't allocate
>> a single contiguous region of requested size for atomic write, then we
>> reject the request always, isn't it. Or maybe I didn't understand your comment.
>
> As the simplest example, for an atomic write to an empty file, there 
> should only be a single mapping returned to iomap_dio_bio_iter() and 
> that would be of IOMAP_UNWRITTEN type. And we don't reject that.
>

Ok. Maybe this is what I am missing. Could you please help me understand
why should such writes be rejected? 

For e.g. 
If FS could allocate a single contiguous IOMAP_UNWRITTEN extent of
atomic write request size, that means - 
1. FS will allocate an unwritten extent.
2. will do writes (using submit_bio) to the unwritten extent. 
3. will do unwritten to written conversion. 

It is ok if either of the above operations fail right? If (3) fails
then the region will still be marked unwritten that means it will read
zero (old contents). (2) can anyway fail and will not result into
partial writes. (1) will anyway not result into any write whatsoever.

So we can never have a situation where there is partial writes leading
to mix of old and new write contents right for such cases? Which is what the
requirement of atomic/untorn write also is?

Sorry am I missing something here?

>> 
>> If others prefer - we can maybe add such a check (e.g. ext4_dio_atomic_write_checks())
>> for atomic writes in ext4_dio_write_checks(), similar to how we detect
>> overwrites case to decide whether we need a read v/s write semaphore.
>> So this can check if the user has a partially allocated extent for the
>> user requested region and if yes, we can return -EINVAL from
>> ext4_dio_write_iter() itself.
>  > > I think this maybe better option than waiting until ->iomap_begin().
>> This might also bring all atomic write constraints to be checked in one
>> place i.e. during ext4_file_write_iter() itself.
>
> Something like this can be done once we decide how atomic writing to 
> regions which cover mixed unwritten and written extents is to be handled.

Mixed extent regions (written + unwritten) is a different case all
together (which can lead to mix of old and new contents).


But here what I am suggesting is to add following constraint in case of
ext4 with bigalloc - 

"Writes to a region which already has partially allocated extent is not supported."

That means we will return -EINVAL if we detect above case in
ext4_file_write_iter() and sure we can document this behavior.

In retrospect, I am not sure why we cannot add a constraint for atomic
writes (e.g. for ext4 bigalloc) and reject such writes outright,
instead of silently incurring a performance penalty by zeroing out the
partial regions by allowing such write request.

-ritesh

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/6] ext4: Add statx support for atomic writes
  2024-10-25 10:08     ` Ritesh Harjani
@ 2024-10-25 16:09       ` Darrick J. Wong
  2024-10-25 17:45         ` Ritesh Harjani
  0 siblings, 1 reply; 39+ messages in thread
From: Darrick J. Wong @ 2024-10-25 16:09 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: John Garry, linux-ext4, Theodore Ts'o, Jan Kara,
	Christoph Hellwig, Ojaswin Mujoo, Dave Chinner, linux-kernel,
	linux-xfs, linux-fsdevel

On Fri, Oct 25, 2024 at 03:38:03PM +0530, Ritesh Harjani wrote:
> John Garry <john.g.garry@oracle.com> writes:
> 
> > On 25/10/2024 04:45, Ritesh Harjani (IBM) wrote:
> >> This patch adds base support for atomic writes via statx getattr.
> >> On bs < ps systems, we can create FS with say bs of 16k. That means
> >> both atomic write min and max unit can be set to 16k for supporting
> >> atomic writes.
> >> 
> >> Later patches adds support for bigalloc as well so that ext4 can also
> >> support doing atomic writes for bs = ps systems.
> >> 
> >> Co-developed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> >> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> >> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> >> ---
> >>   fs/ext4/ext4.h  |  7 ++++++-
> >>   fs/ext4/inode.c | 14 ++++++++++++++
> >>   fs/ext4/super.c | 32 ++++++++++++++++++++++++++++++++
> >>   3 files changed, 52 insertions(+), 1 deletion(-)
> >> 
> >> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> >> index 44b0d418143c..a41e56c2c628 100644
> >> --- a/fs/ext4/ext4.h
> >> +++ b/fs/ext4/ext4.h
> >> @@ -1729,6 +1729,10 @@ struct ext4_sb_info {
> >>   	 */
> >>   	struct work_struct s_sb_upd_work;
> >>   
> >> +	/* Atomic write unit values */
> >> +	unsigned int fs_awu_min;
> >> +	unsigned int fs_awu_max;
> >> +
> >>   	/* Ext4 fast commit sub transaction ID */
> >>   	atomic_t s_fc_subtid;
> >>   
> >> @@ -1820,7 +1824,8 @@ static inline int ext4_valid_inum(struct super_block *sb, unsigned long ino)
> >>    */
> >>   enum {
> >>   	EXT4_MF_MNTDIR_SAMPLED,
> >> -	EXT4_MF_FC_INELIGIBLE	/* Fast commit ineligible */
> >> +	EXT4_MF_FC_INELIGIBLE,	/* Fast commit ineligible */
> >> +	EXT4_MF_ATOMIC_WRITE	/* Supports atomic write */
> >
> > Does this flag really buy us much?
> >
> 
> I felt it is cleaner this way than comparing non-zero values of
> fs_awu_min and fs_awu_max.

What does it mean when MF_ATOMIC_WRITE is set and fs_awu_* are zero?
The awu values don't change at runtime, so I think you can save yourself
an atomic test by checking (non-atomically) for awu_min>0.

(I don't know anything about the flags, those came after my time iirc.)

--D

> Now that you pointed at it - Maybe a question for others who might have
> the history of which one to use when - or do we think there is a scope
> of merging the two into just one as a later cleanup?
> 
> I know that s_mount_flags was added for fastcommit and it needed the
> state manipulations to be done in atomic way. Similarly s_ext4_flags
> also was renamed from s_resize_flags for more general purpose use. Both
> of these looks like could be merged isn't it?
> 
> 
> 
> >>   };
> >>   
> >>   static inline void ext4_set_mount_flag(struct super_block *sb, int bit)
> >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> >> index 54bdd4884fe6..897c028d5bc9 100644
> >> --- a/fs/ext4/inode.c
> >> +++ b/fs/ext4/inode.c
> >> @@ -5578,6 +5578,20 @@ int ext4_getattr(struct mnt_idmap *idmap, const struct path *path,
> >>   		}
> >>   	}
> >>   
> >> +	if (S_ISREG(inode->i_mode) && (request_mask & STATX_WRITE_ATOMIC)) {
> >> +		struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
> >> +		unsigned int awu_min, awu_max;
> >> +
> >> +		if (ext4_test_mount_flag(inode->i_sb, EXT4_MF_ATOMIC_WRITE)) {
> >
> > I'd use ext4_inode_can_atomicwrite() here, similar to what is done for xfs
> >
> 
> Sure since it is inode operation, we can check against ext4_inode_can_atomicwrite().
> 
> 
> >> +			awu_min = sbi->fs_awu_min;
> >> +			awu_max = sbi->fs_awu_max;
> >> +		} else {
> >> +			awu_min = awu_max = 0;
> >> +		}
> >> +
> >> +		generic_fill_statx_atomic_writes(stat, awu_min, awu_max);
> >> +	}
> >> +
> >>   	flags = ei->i_flags & EXT4_FL_USER_VISIBLE;
> >>   	if (flags & EXT4_APPEND_FL)
> >>   		stat->attributes |= STATX_ATTR_APPEND;
> >> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> >> index 16a4ce704460..f5c075aff060 100644
> >> --- a/fs/ext4/super.c
> >> +++ b/fs/ext4/super.c
> >> @@ -4425,6 +4425,37 @@ static int ext4_handle_clustersize(struct super_block *sb)
> >>   	return 0;
> >>   }
> >>   
> >> +/*
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 2/6] ext4: Check for atomic writes support in write iter
  2024-10-25 10:33     ` Ritesh Harjani
@ 2024-10-25 16:11       ` Darrick J. Wong
  2024-10-25 17:50         ` Ritesh Harjani
  0 siblings, 1 reply; 39+ messages in thread
From: Darrick J. Wong @ 2024-10-25 16:11 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: John Garry, linux-ext4, Theodore Ts'o, Jan Kara,
	Christoph Hellwig, Ojaswin Mujoo, Dave Chinner, linux-kernel,
	linux-xfs, linux-fsdevel

On Fri, Oct 25, 2024 at 04:03:02PM +0530, Ritesh Harjani wrote:
> John Garry <john.g.garry@oracle.com> writes:
> 
> > On 25/10/2024 04:45, Ritesh Harjani (IBM) wrote:
> >> Let's validate using generic_atomic_write_valid() in
> >> ext4_file_write_iter() if the write request has IOCB_ATOMIC set.
> >> 
> >> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> >> ---
> >>   fs/ext4/file.c | 14 ++++++++++++++
> >>   1 file changed, 14 insertions(+)
> >> 
> >> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> >> index f14aed14b9cf..b06c5d34bbd2 100644
> >> --- a/fs/ext4/file.c
> >> +++ b/fs/ext4/file.c
> >> @@ -692,6 +692,20 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
> >>   	if (IS_DAX(inode))
> >>   		return ext4_dax_write_iter(iocb, from);
> >>   #endif
> >> +
> >> +	if (iocb->ki_flags & IOCB_ATOMIC) {
> >> +		size_t len = iov_iter_count(from);
> >> +		int ret;
> >> +
> >> +		if (!IS_ALIGNED(len, EXT4_SB(inode->i_sb)->fs_awu_min) ||
> >> +			len > EXT4_SB(inode->i_sb)->fs_awu_max)
> >> +			return -EINVAL;
> >
> > this looks ok, but the IS_ALIGNED() check looks odd. I am not sure why 
> > you don't just check that fs_awu_max >= len >= fs_awu_min
> >
> 
> I guess this was just a stricter check. But we anyways have power_of_2
> and other checks in generic_atomic_write_valid(). So it does not matter. 
> 
> I can change this in v2. 

Also please fix the weird indenting in the if test:

		if (len < EXT4_SB(inode->i_sb)->fs_awu_min) ||
		    len > EXT4_SB(inode->i_sb)->fs_awu_max)
			return -EINVAL;

--D

> Thanks!
> 
> >> +
> >> +		ret = generic_atomic_write_valid(iocb, from);
> >> +		if (ret)
> >> +			return ret;
> >> +	}
> >> +
> >>   	if (iocb->ki_flags & IOCB_DIRECT)
> >>   		return ext4_dio_write_iter(iocb, from);
> >>   	else
> 
> -ritesh
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/6] ext4: Warn if we ever fallback to buffered-io for DIO atomic writes
  2024-10-25  3:45 ` [PATCH 4/6] ext4: Warn if we ever fallback to buffered-io for DIO atomic writes Ritesh Harjani (IBM)
@ 2024-10-25 16:16   ` Darrick J. Wong
  2024-10-25 17:51     ` Ritesh Harjani
  2024-10-27 22:26   ` Dave Chinner
  1 sibling, 1 reply; 39+ messages in thread
From: Darrick J. Wong @ 2024-10-25 16:16 UTC (permalink / raw)
  To: Ritesh Harjani (IBM)
  Cc: linux-ext4, Theodore Ts'o, Jan Kara, Christoph Hellwig,
	John Garry, Ojaswin Mujoo, Dave Chinner, linux-kernel, linux-xfs,
	linux-fsdevel

On Fri, Oct 25, 2024 at 09:15:53AM +0530, Ritesh Harjani (IBM) wrote:
> iomap will not return -ENOTBLK in case of dio atomic writes. But let's
> also add a WARN_ON_ONCE and return -EIO as a safety net.
> 
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> ---
>  fs/ext4/file.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index f9516121a036..af6ebd0ac0d6 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -576,8 +576,16 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  		iomap_ops = &ext4_iomap_overwrite_ops;
>  	ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops,
>  			   dio_flags, NULL, 0);
> -	if (ret == -ENOTBLK)
> +	if (ret == -ENOTBLK) {
>  		ret = 0;
> +		/*
> +		 * iomap will never return -ENOTBLK if write fails for atomic
> +		 * write. But let's just add a safety net.

I think it can if the pagecache invalidation fails, so you really do
need the safety net.  I suspect that the xfs version of this series
needs it too, though it may have fallen out?

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> +		 */
> +		if (WARN_ON_ONCE(iocb->ki_flags & IOCB_ATOMIC))
> +			ret = -EIO;
> +	}
> +
>  	if (extend) {
>  		/*
>  		 * We always perform extending DIO write synchronously so by
> -- 
> 2.46.0
> 
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/6] ext4: Add statx support for atomic writes
  2024-10-25 16:09       ` Darrick J. Wong
@ 2024-10-25 17:45         ` Ritesh Harjani
  0 siblings, 0 replies; 39+ messages in thread
From: Ritesh Harjani @ 2024-10-25 17:45 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: John Garry, linux-ext4, Theodore Ts'o, Jan Kara,
	Christoph Hellwig, Ojaswin Mujoo, Dave Chinner, linux-kernel,
	linux-xfs, linux-fsdevel

"Darrick J. Wong" <djwong@kernel.org> writes:

> On Fri, Oct 25, 2024 at 03:38:03PM +0530, Ritesh Harjani wrote:
>> John Garry <john.g.garry@oracle.com> writes:
>> 
>> > On 25/10/2024 04:45, Ritesh Harjani (IBM) wrote:
>> >> This patch adds base support for atomic writes via statx getattr.
>> >> On bs < ps systems, we can create FS with say bs of 16k. That means
>> >> both atomic write min and max unit can be set to 16k for supporting
>> >> atomic writes.
>> >> 
>> >> Later patches adds support for bigalloc as well so that ext4 can also
>> >> support doing atomic writes for bs = ps systems.
>> >> 
>> >> Co-developed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
>> >> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
>> >> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>> >> ---
>> >>   fs/ext4/ext4.h  |  7 ++++++-
>> >>   fs/ext4/inode.c | 14 ++++++++++++++
>> >>   fs/ext4/super.c | 32 ++++++++++++++++++++++++++++++++
>> >>   3 files changed, 52 insertions(+), 1 deletion(-)
>> >> 
>> >> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>> >> index 44b0d418143c..a41e56c2c628 100644
>> >> --- a/fs/ext4/ext4.h
>> >> +++ b/fs/ext4/ext4.h
>> >> @@ -1729,6 +1729,10 @@ struct ext4_sb_info {
>> >>   	 */
>> >>   	struct work_struct s_sb_upd_work;
>> >>   
>> >> +	/* Atomic write unit values */
>> >> +	unsigned int fs_awu_min;
>> >> +	unsigned int fs_awu_max;
>> >> +
>> >>   	/* Ext4 fast commit sub transaction ID */
>> >>   	atomic_t s_fc_subtid;
>> >>   
>> >> @@ -1820,7 +1824,8 @@ static inline int ext4_valid_inum(struct super_block *sb, unsigned long ino)
>> >>    */
>> >>   enum {
>> >>   	EXT4_MF_MNTDIR_SAMPLED,
>> >> -	EXT4_MF_FC_INELIGIBLE	/* Fast commit ineligible */
>> >> +	EXT4_MF_FC_INELIGIBLE,	/* Fast commit ineligible */
>> >> +	EXT4_MF_ATOMIC_WRITE	/* Supports atomic write */
>> >
>> > Does this flag really buy us much?
>> >
>> 
>> I felt it is cleaner this way than comparing non-zero values of
>> fs_awu_min and fs_awu_max.
>
> What does it mean when MF_ATOMIC_WRITE is set and fs_awu_* are zero?
> The awu values don't change at runtime, so I think you can save yourself
> an atomic test by checking (non-atomically) for awu_min>0.

Sure. I agree with the reasoning that we can just check for awu_min > 0.
I can write an inline helper for this.

>
> (I don't know anything about the flags, those came after my time iirc.)
>

Thanks for the review :) 

-ritesh

> --D
>
>> Now that you pointed at it - Maybe a question for others who might have
>> the history of which one to use when - or do we think there is a scope
>> of merging the two into just one as a later cleanup?
>> 
>> I know that s_mount_flags was added for fastcommit and it needed the
>> state manipulations to be done in atomic way. Similarly s_ext4_flags
>> also was renamed from s_resize_flags for more general purpose use. Both
>> of these looks like could be merged isn't it?
>> 
>> 
>> 
>> >>   };
>> >>   
>> >>   static inline void ext4_set_mount_flag(struct super_block *sb, int bit)
>> >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> >> index 54bdd4884fe6..897c028d5bc9 100644
>> >> --- a/fs/ext4/inode.c
>> >> +++ b/fs/ext4/inode.c
>> >> @@ -5578,6 +5578,20 @@ int ext4_getattr(struct mnt_idmap *idmap, const struct path *path,
>> >>   		}
>> >>   	}
>> >>   
>> >> +	if (S_ISREG(inode->i_mode) && (request_mask & STATX_WRITE_ATOMIC)) {
>> >> +		struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
>> >> +		unsigned int awu_min, awu_max;
>> >> +
>> >> +		if (ext4_test_mount_flag(inode->i_sb, EXT4_MF_ATOMIC_WRITE)) {
>> >
>> > I'd use ext4_inode_can_atomicwrite() here, similar to what is done for xfs
>> >
>> 
>> Sure since it is inode operation, we can check against ext4_inode_can_atomicwrite().
>> 
>> 
>> >> +			awu_min = sbi->fs_awu_min;
>> >> +			awu_max = sbi->fs_awu_max;
>> >> +		} else {
>> >> +			awu_min = awu_max = 0;
>> >> +		}
>> >> +
>> >> +		generic_fill_statx_atomic_writes(stat, awu_min, awu_max);
>> >> +	}
>> >> +
>> >>   	flags = ei->i_flags & EXT4_FL_USER_VISIBLE;
>> >>   	if (flags & EXT4_APPEND_FL)
>> >>   		stat->attributes |= STATX_ATTR_APPEND;
>> >> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
>> >> index 16a4ce704460..f5c075aff060 100644
>> >> --- a/fs/ext4/super.c
>> >> +++ b/fs/ext4/super.c
>> >> @@ -4425,6 +4425,37 @@ static int ext4_handle_clustersize(struct super_block *sb)
>> >>   	return 0;
>> >>   }
>> >>   
>> >> +/*
>> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 2/6] ext4: Check for atomic writes support in write iter
  2024-10-25 16:11       ` Darrick J. Wong
@ 2024-10-25 17:50         ` Ritesh Harjani
  0 siblings, 0 replies; 39+ messages in thread
From: Ritesh Harjani @ 2024-10-25 17:50 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: John Garry, linux-ext4, Theodore Ts'o, Jan Kara,
	Christoph Hellwig, Ojaswin Mujoo, Dave Chinner, linux-kernel,
	linux-xfs, linux-fsdevel

"Darrick J. Wong" <djwong@kernel.org> writes:

> On Fri, Oct 25, 2024 at 04:03:02PM +0530, Ritesh Harjani wrote:
>> John Garry <john.g.garry@oracle.com> writes:
>> 
>> > On 25/10/2024 04:45, Ritesh Harjani (IBM) wrote:
>> >> Let's validate using generic_atomic_write_valid() in
>> >> ext4_file_write_iter() if the write request has IOCB_ATOMIC set.
>> >> 
>> >> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>> >> ---
>> >>   fs/ext4/file.c | 14 ++++++++++++++
>> >>   1 file changed, 14 insertions(+)
>> >> 
>> >> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
>> >> index f14aed14b9cf..b06c5d34bbd2 100644
>> >> --- a/fs/ext4/file.c
>> >> +++ b/fs/ext4/file.c
>> >> @@ -692,6 +692,20 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>> >>   	if (IS_DAX(inode))
>> >>   		return ext4_dax_write_iter(iocb, from);
>> >>   #endif
>> >> +
>> >> +	if (iocb->ki_flags & IOCB_ATOMIC) {
>> >> +		size_t len = iov_iter_count(from);
>> >> +		int ret;
>> >> +
>> >> +		if (!IS_ALIGNED(len, EXT4_SB(inode->i_sb)->fs_awu_min) ||
>> >> +			len > EXT4_SB(inode->i_sb)->fs_awu_max)
>> >> +			return -EINVAL;
>> >
>> > this looks ok, but the IS_ALIGNED() check looks odd. I am not sure why 
>> > you don't just check that fs_awu_max >= len >= fs_awu_min
>> >
>> 
>> I guess this was just a stricter check. But we anyways have power_of_2
>> and other checks in generic_atomic_write_valid(). So it does not matter. 
>> 
>> I can change this in v2. 
>
> Also please fix the weird indenting in the if test:
>
> 		if (len < EXT4_SB(inode->i_sb)->fs_awu_min) ||
> 		    len > EXT4_SB(inode->i_sb)->fs_awu_max)
> 			return -EINVAL;
>
> --D

Got it!

-ritesh

>
>> Thanks!
>> 
>> >> +
>> >> +		ret = generic_atomic_write_valid(iocb, from);
>> >> +		if (ret)
>> >> +			return ret;
>> >> +	}
>> >> +
>> >>   	if (iocb->ki_flags & IOCB_DIRECT)
>> >>   		return ext4_dio_write_iter(iocb, from);
>> >>   	else
>> 
>> -ritesh
>> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/6] ext4: Warn if we ever fallback to buffered-io for DIO atomic writes
  2024-10-25 16:16   ` Darrick J. Wong
@ 2024-10-25 17:51     ` Ritesh Harjani
  0 siblings, 0 replies; 39+ messages in thread
From: Ritesh Harjani @ 2024-10-25 17:51 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-ext4, Theodore Ts'o, Jan Kara, Christoph Hellwig,
	John Garry, Ojaswin Mujoo, Dave Chinner, linux-kernel, linux-xfs,
	linux-fsdevel

"Darrick J. Wong" <djwong@kernel.org> writes:

> On Fri, Oct 25, 2024 at 09:15:53AM +0530, Ritesh Harjani (IBM) wrote:
>> iomap will not return -ENOTBLK in case of dio atomic writes. But let's
>> also add a WARN_ON_ONCE and return -EIO as a safety net.
>> 
>> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>> ---
>>  fs/ext4/file.c | 10 +++++++++-
>>  1 file changed, 9 insertions(+), 1 deletion(-)
>> 
>> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
>> index f9516121a036..af6ebd0ac0d6 100644
>> --- a/fs/ext4/file.c
>> +++ b/fs/ext4/file.c
>> @@ -576,8 +576,16 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
>>  		iomap_ops = &ext4_iomap_overwrite_ops;
>>  	ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops,
>>  			   dio_flags, NULL, 0);
>> -	if (ret == -ENOTBLK)
>> +	if (ret == -ENOTBLK) {
>>  		ret = 0;
>> +		/*
>> +		 * iomap will never return -ENOTBLK if write fails for atomic
>> +		 * write. But let's just add a safety net.
>
> I think it can if the pagecache invalidation fails, so you really do
> need the safety net.

Ah, right! So in that case I should remove WARN_ON_ONCE and correct
the comment too.

> I suspect that the xfs version of this series
> needs it too, though it may have fallen out?
>

I think so yes. Looks like it got missed.


> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
>
> --D
>

Thanks for pointing that out.

-ritesh

>> +		 */
>> +		if (WARN_ON_ONCE(iocb->ki_flags & IOCB_ATOMIC))
>> +			ret = -EIO;
>> +	}
>> +
>>  	if (extend) {
>>  		/*
>>  		 * We always perform extending DIO write synchronously so by
>> -- 
>> 2.46.0
>> 
>> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/6] iomap: Lift blocksize restriction on atomic writes
  2024-10-25 14:13                     ` Ritesh Harjani
@ 2024-10-25 18:28                       ` Darrick J. Wong
  2024-10-26  4:35                         ` Ritesh Harjani
  0 siblings, 1 reply; 39+ messages in thread
From: Darrick J. Wong @ 2024-10-25 18:28 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: John Garry, linux-ext4, Theodore Ts'o, Jan Kara,
	Christoph Hellwig, Ojaswin Mujoo, Dave Chinner, linux-kernel,
	linux-xfs, linux-fsdevel

On Fri, Oct 25, 2024 at 07:43:17PM +0530, Ritesh Harjani wrote:
> John Garry <john.g.garry@oracle.com> writes:
> 
> > On 25/10/2024 13:36, Ritesh Harjani (IBM) wrote:
> >>>> So user will anyway will have to be made aware of not to
> >>>> attempt writes of fashion which can cause them such penalties.
> >>>>
> >>>> As patch-6 mentions this is a base support for bs = ps systems for
> >>>> enabling atomic writes using bigalloc. For now we return -EINVAL when we
> >>>> can't allocate a continuous user requested mapping which means it won't
> >>>> support operations of types 8k followed by 16k.
> >>>>
> >>> That's my least-preferred option.
> >>>
> >>> I think better would be reject atomic writes that cover unwritten
> >>> extents always - but that boat is about to sail...
> >> That's what this patch does.
> >
> > Not really.
> >
> > Currently we have 2x iomap restrictions:
> > a. mapping length must equal fs block size
> > b. bio created must equal total write size
> >
> > This patch just says that the mapping length must equal total write size 
> > (instead of a.). So quite similar to b.
> >
> >> For whatever reason if we couldn't allocate
> >> a single contiguous region of requested size for atomic write, then we
> >> reject the request always, isn't it. Or maybe I didn't understand your comment.
> >
> > As the simplest example, for an atomic write to an empty file, there 
> > should only be a single mapping returned to iomap_dio_bio_iter() and 
> > that would be of IOMAP_UNWRITTEN type. And we don't reject that.
> >
> 
> Ok. Maybe this is what I am missing. Could you please help me understand
> why should such writes be rejected? 
> 
> For e.g. 
> If FS could allocate a single contiguous IOMAP_UNWRITTEN extent of
> atomic write request size, that means - 
> 1. FS will allocate an unwritten extent.
> 2. will do writes (using submit_bio) to the unwritten extent. 
> 3. will do unwritten to written conversion. 
> 
> It is ok if either of the above operations fail right? If (3) fails
> then the region will still be marked unwritten that means it will read
> zero (old contents). (2) can anyway fail and will not result into
> partial writes. (1) will anyway not result into any write whatsoever.
> 
> So we can never have a situation where there is partial writes leading
> to mix of old and new write contents right for such cases? Which is what the
> requirement of atomic/untorn write also is?
> 
> Sorry am I missing something here?

I must be missing something; to perform an untorn write, two things must
happen --

1. The kernel writes the data to the storage device, and the storage
device either persists all of it, or throws back an error having
persisted none of it.

2. If (1) completes successfully, all file mapping updates for the range
written must be persisted, or an error is thrown back and none of them
are persisted.

iomap doesn't have to know how the filesystem satisfies (2); it just has
to create a single bio containing all data pages or it rejects the
write.

Currently, it's an implementation detail that the XFS directio write
ioend code processes the file mapping updates for the range written by
walking every extent mapping for that range and issuing separate
transactions for each mapping update.  There's nothing that can restart
the walk if it is interrupted.  That's why XFS cannot support multi
fsblock untorn writes to blocks with different status.

As I've said before, the most general solution to this would be to add a
new log intent item that would track the "update all mappings in this
file range" operation so that recovery could restart the walk.  This is
the most technically challenging, so we decided not to implement it
until there is demand.

Having set aside the idea of redesigning ioend, the second-most general
solution is pre-zeroing unwritten extents and holes so that
->iomap_begin implementations can present a single mapping to the bio
constructor.  Technically if there's only one unwritten extent or hole
or cow, xfs can actually satisfy (2) because it only creates one
transaction.

This gets me to the third and much less general solution -- only allow
untorn writes if we know that the ioend only ever has to run a single
transaction.  That's why untorn writes are limited to a single fsblock
for now -- it's a simple solution so that we can get our downstream
customers to kick the tires and start on the next iteration instead of
spending years on waterfalling.

Did you notice that in all of these cases, the capabilities of the
filesystem's ioend processing determines the restrictions on the number
and type of mappings that ->iomap_begin can give to iomap?

Now that we have a second system trying to hook up to the iomap support,
it's clear to me that the restrictions on mappings are specific to each
filesystem.  Therefore, the iomap directio code should not impose
restrictions on the mappings it receives unless they would prevent the
creation of the single aligned bio.

Instead, xfs_direct_write_iomap_begin and ext4_iomap_begin should return
EINVAL or something if they look at the file mappings and discover that
they cannot perform the ioend without risking torn mapping updates.  In
the long run, ->iomap_begin is where this iomap->len <= iter->len check
really belongs, but hold that thought.

For the multi fsblock case, the ->iomap_begin functions would have to
check that only one metadata update would be necessary in the ioend.
That's where things get murky, since ext4/xfs drop their mapping locks
between calls to ->iomap_begin.  So you'd have to check all the mappings
for unsupported mixed state every time.  Yuck.

It might be less gross to retain the restriction that iomap accepts only
one mapping for the entire file range, like Ritesh has here.  Users
might be ok with us saying that you can't do a 16k atomic write to a
region where you previously did an 8k write until you write the other
8k, even if someone has to write zeroes.  Users might be ok with the
kernel allowing multi-fsblock writes but only if the stars align.  But
to learn the answers to those questions, we have to put /something/ in
the hands of our users.

For now (because we're already at -rc5), let's have xfs/ext4's
->write_iter implementations restrict atomic writes to a single fsblock,
and get both merged into the kernel.  Let's defer the multi fsblock work
to 6.14, though I think we could take this patch.

Does that sound cool?

--D

> >> 
> >> If others prefer - we can maybe add such a check (e.g. ext4_dio_atomic_write_checks())
> >> for atomic writes in ext4_dio_write_checks(), similar to how we detect
> >> overwrites case to decide whether we need a read v/s write semaphore.
> >> So this can check if the user has a partially allocated extent for the
> >> user requested region and if yes, we can return -EINVAL from
> >> ext4_dio_write_iter() itself.
> >  > > I think this maybe better option than waiting until ->iomap_begin().
> >> This might also bring all atomic write constraints to be checked in one
> >> place i.e. during ext4_file_write_iter() itself.
> >
> > Something like this can be done once we decide how atomic writing to 
> > regions which cover mixed unwritten and written extents is to be handled.
> 
> Mixed extent regions (written + unwritten) is a different case all
> together (which can lead to mix of old and new contents).
> 
> 
> But here what I am suggesting is to add following constraint in case of
> ext4 with bigalloc - 
> 
> "Writes to a region which already has partially allocated extent is not supported."
> 
> That means we will return -EINVAL if we detect above case in
> ext4_file_write_iter() and sure we can document this behavior.
> 
> In retrospect, I am not sure why we cannot add a constraint for atomic
> writes (e.g. for ext4 bigalloc) and reject such writes outright,
> instead of silently incurring a performance penalty by zeroing out the
> partial regions by allowing such write request.
> 
> -ritesh
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/6] iomap: Lift blocksize restriction on atomic writes
  2024-10-25 18:28                       ` Darrick J. Wong
@ 2024-10-26  4:35                         ` Ritesh Harjani
  2024-10-31 21:36                           ` Darrick J. Wong
  0 siblings, 1 reply; 39+ messages in thread
From: Ritesh Harjani @ 2024-10-26  4:35 UTC (permalink / raw)
  To: Darrick J. Wong, Theodore Ts'o
  Cc: John Garry, linux-ext4, Jan Kara, Christoph Hellwig,
	Ojaswin Mujoo, Dave Chinner, linux-kernel, linux-xfs,
	linux-fsdevel

"Darrick J. Wong" <djwong@kernel.org> writes:

> On Fri, Oct 25, 2024 at 07:43:17PM +0530, Ritesh Harjani wrote:
>> John Garry <john.g.garry@oracle.com> writes:
>> 
>> > On 25/10/2024 13:36, Ritesh Harjani (IBM) wrote:
>> >>>> So user will anyway will have to be made aware of not to
>> >>>> attempt writes of fashion which can cause them such penalties.
>> >>>>
>> >>>> As patch-6 mentions this is a base support for bs = ps systems for
>> >>>> enabling atomic writes using bigalloc. For now we return -EINVAL when we
>> >>>> can't allocate a continuous user requested mapping which means it won't
>> >>>> support operations of types 8k followed by 16k.
>> >>>>
>> >>> That's my least-preferred option.
>> >>>
>> >>> I think better would be reject atomic writes that cover unwritten
>> >>> extents always - but that boat is about to sail...
>> >> That's what this patch does.
>> >
>> > Not really.
>> >
>> > Currently we have 2x iomap restrictions:
>> > a. mapping length must equal fs block size
>> > b. bio created must equal total write size
>> >
>> > This patch just says that the mapping length must equal total write size 
>> > (instead of a.). So quite similar to b.
>> >
>> >> For whatever reason if we couldn't allocate
>> >> a single contiguous region of requested size for atomic write, then we
>> >> reject the request always, isn't it. Or maybe I didn't understand your comment.
>> >
>> > As the simplest example, for an atomic write to an empty file, there 
>> > should only be a single mapping returned to iomap_dio_bio_iter() and 
>> > that would be of IOMAP_UNWRITTEN type. And we don't reject that.
>> >
>> 
>> Ok. Maybe this is what I am missing. Could you please help me understand
>> why should such writes be rejected? 
>> 
>> For e.g. 
>> If FS could allocate a single contiguous IOMAP_UNWRITTEN extent of
>> atomic write request size, that means - 
>> 1. FS will allocate an unwritten extent.
>> 2. will do writes (using submit_bio) to the unwritten extent. 
>> 3. will do unwritten to written conversion. 
>> 
>> It is ok if either of the above operations fail right? If (3) fails
>> then the region will still be marked unwritten that means it will read
>> zero (old contents). (2) can anyway fail and will not result into
>> partial writes. (1) will anyway not result into any write whatsoever.
>> 
>> So we can never have a situation where there is partial writes leading
>> to mix of old and new write contents right for such cases? Which is what the
>> requirement of atomic/untorn write also is?
>> 
>> Sorry am I missing something here?
>
> I must be missing something; to perform an untorn write, two things must
> happen --
>
> 1. The kernel writes the data to the storage device, and the storage
> device either persists all of it, or throws back an error having
> persisted none of it.
>
> 2. If (1) completes successfully, all file mapping updates for the range
> written must be persisted, or an error is thrown back and none of them
> are persisted.
>
> iomap doesn't have to know how the filesystem satisfies (2); it just has
> to create a single bio containing all data pages or it rejects the
> write.
>
> Currently, it's an implementation detail that the XFS directio write
> ioend code processes the file mapping updates for the range written by
> walking every extent mapping for that range and issuing separate
> transactions for each mapping update.  There's nothing that can restart
> the walk if it is interrupted.  That's why XFS cannot support multi
> fsblock untorn writes to blocks with different status.
>
> As I've said before, the most general solution to this would be to add a
> new log intent item that would track the "update all mappings in this
> file range" operation so that recovery could restart the walk.  This is
> the most technically challenging, so we decided not to implement it
> until there is demand.
>
> Having set aside the idea of redesigning ioend, the second-most general
> solution is pre-zeroing unwritten extents and holes so that
> ->iomap_begin implementations can present a single mapping to the bio
> constructor.  Technically if there's only one unwritten extent or hole
> or cow, xfs can actually satisfy (2) because it only creates one
> transaction.
>
> This gets me to the third and much less general solution -- only allow
> untorn writes if we know that the ioend only ever has to run a single
> transaction.  That's why untorn writes are limited to a single fsblock
> for now -- it's a simple solution so that we can get our downstream
> customers to kick the tires and start on the next iteration instead of
> spending years on waterfalling.
>
> Did you notice that in all of these cases, the capabilities of the
> filesystem's ioend processing determines the restrictions on the number
> and type of mappings that ->iomap_begin can give to iomap?
>
> Now that we have a second system trying to hook up to the iomap support,
> it's clear to me that the restrictions on mappings are specific to each
> filesystem.  Therefore, the iomap directio code should not impose
> restrictions on the mappings it receives unless they would prevent the
> creation of the single aligned bio.
>
> Instead, xfs_direct_write_iomap_begin and ext4_iomap_begin should return
> EINVAL or something if they look at the file mappings and discover that
> they cannot perform the ioend without risking torn mapping updates.  In
> the long run, ->iomap_begin is where this iomap->len <= iter->len check
> really belongs, but hold that thought.
>
> For the multi fsblock case, the ->iomap_begin functions would have to
> check that only one metadata update would be necessary in the ioend.
> That's where things get murky, since ext4/xfs drop their mapping locks
> between calls to ->iomap_begin.  So you'd have to check all the mappings
> for unsupported mixed state every time.  Yuck.
>

Thanks Darrick for taking time summarizing what all has been done
and your thoughts here.

> It might be less gross to retain the restriction that iomap accepts only
> one mapping for the entire file range, like Ritesh has here.

less gross :) sure. 

I would like to think of this as, being less restrictive (compared to
only allowing a single fsblock) by adding a constraint on the atomic
write I/O request i.e.  

"Atomic write I/O request to a region in a file is only allowed if that
region has no partially allocated extents. Otherwise, the file system
can fail the I/O operation by returning -EINVAL."

Essentially by adding this constraint to the I/O request, we are
helping the user to prevent atomic writes from accidentally getting
torned and also allowing multi-fsblock writes. So I still think that
might be the right thing to do here or at least a better start. FS can
later work on adding such support where we don't even need above
such constraint on a given atomic write I/O request.

> Users
> might be ok with us saying that you can't do a 16k atomic write to a
> region where you previously did an 8k write until you write the other
> 8k, even if someone has to write zeroes.  Users might be ok with the
> kernel allowing multi-fsblock writes but only if the stars align.

> But
> to learn the answers to those questions, we have to put /something/ in
> the hands of our users.

On this point, I think ext4 might already has those users who might be
using atomic write characteristics of devices to do untorn writes. e.g. 

In [1], Ted has talked about using bigalloc with ext4 for torn write
prevention. [2] talks about using ext4 with bigalloc to prevent torn
writes on aws cloud.

[1]: https://www.youtube.com/watch?v=gIeuiGg-_iw
[2]: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/configure-twp.html

My point being - Looks like the class of users who are using untorn
writes to improve their database performances are already doing so even
w/o any such interfaces being exposed to them (with ext4 bigalloc).

The current feature support of allowing atomic writes to only single
fsblock might not be helpful to these users who can provide that
feedback, who are using ext4 on bs = ps systems with bigalloc. But maybe
let's wait and hear from them whether it is ok if -   

"Atomic write I/O request to a region in a file is only allowed if that
region has no partially allocated extents. Otherwise, the file system
can fail the I/O operation by returning -EINVAL."

>
> For now (because we're already at -rc5), let's have xfs/ext4's
> ->write_iter implementations restrict atomic writes to a single fsblock,
> and get both merged into the kernel.

Yes, I agree with the approach. I agree that we should get a consensus
on this from folks.

Let me split this series up and address the review comments on patch
[1-4]. Patch-5 & 6 can be worked once we have conclusion on this and can
be eyed for 6.14.

> Let's defer the multi fsblock work
> to 6.14, though I think we could take this patch.

It's ok to consider this patch along with multi-fsblock work then i.e.
for 6.14.

>
> Does that sound cool?
>
> --D

Thanks Darrick :)

-ritesh

>> >> 
>> >> If others prefer - we can maybe add such a check (e.g. ext4_dio_atomic_write_checks())
>> >> for atomic writes in ext4_dio_write_checks(), similar to how we detect
>> >> overwrites case to decide whether we need a read v/s write semaphore.
>> >> So this can check if the user has a partially allocated extent for the
>> >> user requested region and if yes, we can return -EINVAL from
>> >> ext4_dio_write_iter() itself.
>> >  > > I think this maybe better option than waiting until ->iomap_begin().
>> >> This might also bring all atomic write constraints to be checked in one
>> >> place i.e. during ext4_file_write_iter() itself.
>> >
>> > Something like this can be done once we decide how atomic writing to 
>> > regions which cover mixed unwritten and written extents is to be handled.
>> 
>> Mixed extent regions (written + unwritten) is a different case all
>> together (which can lead to mix of old and new contents).
>> 
>> 
>> But here what I am suggesting is to add following constraint in case of
>> ext4 with bigalloc - 
>> 
>> "Writes to a region which already has partially allocated extent is not supported."
>> 
>> That means we will return -EINVAL if we detect above case in
>> ext4_file_write_iter() and sure we can document this behavior.
>> 
>> In retrospect, I am not sure why we cannot add a constraint for atomic
>> writes (e.g. for ext4 bigalloc) and reject such writes outright,
>> instead of silently incurring a performance penalty by zeroing out the
>> partial regions by allowing such write request.
>> 
>> -ritesh
>> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/6] ext4: Warn if we ever fallback to buffered-io for DIO atomic writes
  2024-10-25  3:45 ` [PATCH 4/6] ext4: Warn if we ever fallback to buffered-io for DIO atomic writes Ritesh Harjani (IBM)
  2024-10-25 16:16   ` Darrick J. Wong
@ 2024-10-27 22:26   ` Dave Chinner
  2024-10-28  1:09     ` Ritesh Harjani
  1 sibling, 1 reply; 39+ messages in thread
From: Dave Chinner @ 2024-10-27 22:26 UTC (permalink / raw)
  To: Ritesh Harjani (IBM)
  Cc: linux-ext4, Theodore Ts'o, Jan Kara, Darrick J . Wong,
	Christoph Hellwig, John Garry, Ojaswin Mujoo, linux-kernel,
	linux-xfs, linux-fsdevel

On Fri, Oct 25, 2024 at 09:15:53AM +0530, Ritesh Harjani (IBM) wrote:
> iomap will not return -ENOTBLK in case of dio atomic writes. But let's
> also add a WARN_ON_ONCE and return -EIO as a safety net.
> 
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> ---
>  fs/ext4/file.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index f9516121a036..af6ebd0ac0d6 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -576,8 +576,16 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  		iomap_ops = &ext4_iomap_overwrite_ops;
>  	ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops,
>  			   dio_flags, NULL, 0);
> -	if (ret == -ENOTBLK)
> +	if (ret == -ENOTBLK) {
>  		ret = 0;
> +		/*
> +		 * iomap will never return -ENOTBLK if write fails for atomic
> +		 * write. But let's just add a safety net.
> +		 */
> +		if (WARN_ON_ONCE(iocb->ki_flags & IOCB_ATOMIC))
> +			ret = -EIO;
> +	}

Why can't the iomap code return EIO in this case for IOCB_ATOMIC?
That way we don't have to put this logic into every filesystem.

When/if we start supporting atomic writes for buffered IO, then it's
worth pushing this out to filesystems, but right now it doesn't seem
necessary...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/6] ext4: Warn if we ever fallback to buffered-io for DIO atomic writes
  2024-10-27 22:26   ` Dave Chinner
@ 2024-10-28  1:09     ` Ritesh Harjani
  2024-10-28  5:26       ` Dave Chinner
  0 siblings, 1 reply; 39+ messages in thread
From: Ritesh Harjani @ 2024-10-28  1:09 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-ext4, Theodore Ts'o, Jan Kara, Darrick J . Wong,
	Christoph Hellwig, John Garry, Ojaswin Mujoo, linux-kernel,
	linux-xfs, linux-fsdevel


Hi Dave, 

Dave Chinner <david@fromorbit.com> writes:

> On Fri, Oct 25, 2024 at 09:15:53AM +0530, Ritesh Harjani (IBM) wrote:
>> iomap will not return -ENOTBLK in case of dio atomic writes. But let's
>> also add a WARN_ON_ONCE and return -EIO as a safety net.
>> 
>> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>> ---
>>  fs/ext4/file.c | 10 +++++++++-
>>  1 file changed, 9 insertions(+), 1 deletion(-)
>> 
>> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
>> index f9516121a036..af6ebd0ac0d6 100644
>> --- a/fs/ext4/file.c
>> +++ b/fs/ext4/file.c
>> @@ -576,8 +576,16 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
>>  		iomap_ops = &ext4_iomap_overwrite_ops;
>>  	ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops,
>>  			   dio_flags, NULL, 0);
>> -	if (ret == -ENOTBLK)
>> +	if (ret == -ENOTBLK) {
>>  		ret = 0;
>> +		/*
>> +		 * iomap will never return -ENOTBLK if write fails for atomic
>> +		 * write. But let's just add a safety net.
>> +		 */
>> +		if (WARN_ON_ONCE(iocb->ki_flags & IOCB_ATOMIC))
>> +			ret = -EIO;
>> +	}
>
> Why can't the iomap code return EIO in this case for IOCB_ATOMIC?
> That way we don't have to put this logic into every filesystem.

This was origially intended as a safety net hence the WARN_ON_ONCE.
Later Darrick pointed out that we still might have an unconverted
condition in iomap which can return ENOTBLK for DIO atomic writes (page
cache invalidation).

You pointed it right that it should be fixed in iomap. However do you
think filesystems can still keep this as safety net (maybe no need of
WARN_ON_ONCE).

>
> When/if we start supporting atomic writes for buffered IO, then it's
> worth pushing this out to filesystems, but right now it doesn't seem
> necessary...
>
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/6] ext4: Warn if we ever fallback to buffered-io for DIO atomic writes
  2024-10-28  1:09     ` Ritesh Harjani
@ 2024-10-28  5:26       ` Dave Chinner
  2024-10-28  8:43         ` Ritesh Harjani
  2024-10-28 18:14         ` Ritesh Harjani
  0 siblings, 2 replies; 39+ messages in thread
From: Dave Chinner @ 2024-10-28  5:26 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: linux-ext4, Theodore Ts'o, Jan Kara, Darrick J . Wong,
	Christoph Hellwig, John Garry, Ojaswin Mujoo, linux-kernel,
	linux-xfs, linux-fsdevel

On Mon, Oct 28, 2024 at 06:39:36AM +0530, Ritesh Harjani wrote:
> 
> Hi Dave, 
> 
> Dave Chinner <david@fromorbit.com> writes:
> 
> > On Fri, Oct 25, 2024 at 09:15:53AM +0530, Ritesh Harjani (IBM) wrote:
> >> iomap will not return -ENOTBLK in case of dio atomic writes. But let's
> >> also add a WARN_ON_ONCE and return -EIO as a safety net.
> >> 
> >> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> >> ---
> >>  fs/ext4/file.c | 10 +++++++++-
> >>  1 file changed, 9 insertions(+), 1 deletion(-)
> >> 
> >> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> >> index f9516121a036..af6ebd0ac0d6 100644
> >> --- a/fs/ext4/file.c
> >> +++ b/fs/ext4/file.c
> >> @@ -576,8 +576,16 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
> >>  		iomap_ops = &ext4_iomap_overwrite_ops;
> >>  	ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops,
> >>  			   dio_flags, NULL, 0);
> >> -	if (ret == -ENOTBLK)
> >> +	if (ret == -ENOTBLK) {
> >>  		ret = 0;
> >> +		/*
> >> +		 * iomap will never return -ENOTBLK if write fails for atomic
> >> +		 * write. But let's just add a safety net.
> >> +		 */
> >> +		if (WARN_ON_ONCE(iocb->ki_flags & IOCB_ATOMIC))
> >> +			ret = -EIO;
> >> +	}
> >
> > Why can't the iomap code return EIO in this case for IOCB_ATOMIC?
> > That way we don't have to put this logic into every filesystem.
> 
> This was origially intended as a safety net hence the WARN_ON_ONCE.
> Later Darrick pointed out that we still might have an unconverted
> condition in iomap which can return ENOTBLK for DIO atomic writes (page
> cache invalidation).

Yes. That's my point - iomap knows that it's an atomic write, it
knows that invalidation failed, and it knows that there is no such
thing as buffered atomic writes. So there is no possible fallback
here, and it should be returning EIO in the page cache invalidation
failure case and not ENOTBLK.

> You pointed it right that it should be fixed in iomap. However do you
> think filesystems can still keep this as safety net (maybe no need of
> WARN_ON_ONCE).

I don't see any point in adding "impossible to hit" checks into
filesystems just in case some core infrastructure has a bug
introduced....

-Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/6] ext4: Warn if we ever fallback to buffered-io for DIO atomic writes
  2024-10-28  5:26       ` Dave Chinner
@ 2024-10-28  8:43         ` Ritesh Harjani
  2024-10-28 18:14         ` Ritesh Harjani
  1 sibling, 0 replies; 39+ messages in thread
From: Ritesh Harjani @ 2024-10-28  8:43 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-ext4, Theodore Ts'o, Jan Kara, Darrick J . Wong,
	Christoph Hellwig, John Garry, Ojaswin Mujoo, linux-kernel,
	linux-xfs, linux-fsdevel

Dave Chinner <david@fromorbit.com> writes:

> On Mon, Oct 28, 2024 at 06:39:36AM +0530, Ritesh Harjani wrote:
>> 
>> Hi Dave, 
>> 
>> Dave Chinner <david@fromorbit.com> writes:
>> 
>> > On Fri, Oct 25, 2024 at 09:15:53AM +0530, Ritesh Harjani (IBM) wrote:
>> >> iomap will not return -ENOTBLK in case of dio atomic writes. But let's
>> >> also add a WARN_ON_ONCE and return -EIO as a safety net.
>> >> 
>> >> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>> >> ---
>> >>  fs/ext4/file.c | 10 +++++++++-
>> >>  1 file changed, 9 insertions(+), 1 deletion(-)
>> >> 
>> >> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
>> >> index f9516121a036..af6ebd0ac0d6 100644
>> >> --- a/fs/ext4/file.c
>> >> +++ b/fs/ext4/file.c
>> >> @@ -576,8 +576,16 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
>> >>  		iomap_ops = &ext4_iomap_overwrite_ops;
>> >>  	ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops,
>> >>  			   dio_flags, NULL, 0);
>> >> -	if (ret == -ENOTBLK)
>> >> +	if (ret == -ENOTBLK) {
>> >>  		ret = 0;
>> >> +		/*
>> >> +		 * iomap will never return -ENOTBLK if write fails for atomic
>> >> +		 * write. But let's just add a safety net.
>> >> +		 */
>> >> +		if (WARN_ON_ONCE(iocb->ki_flags & IOCB_ATOMIC))
>> >> +			ret = -EIO;
>> >> +	}
>> >
>> > Why can't the iomap code return EIO in this case for IOCB_ATOMIC?
>> > That way we don't have to put this logic into every filesystem.
>> 
>> This was origially intended as a safety net hence the WARN_ON_ONCE.
>> Later Darrick pointed out that we still might have an unconverted
>> condition in iomap which can return ENOTBLK for DIO atomic writes (page
>> cache invalidation).
>
> Yes. That's my point - iomap knows that it's an atomic write, it
> knows that invalidation failed, and it knows that there is no such
> thing as buffered atomic writes. So there is no possible fallback
> here, and it should be returning EIO in the page cache invalidation
> failure case and not ENOTBLK.
>

Sorry my bad. I think I might have looked into a different version of
the code earlier. So the current patch from John already takes care of
the condition where if the page cache invalidation fails we don't return
-ENOTBLK [1]

[1]: https://lore.kernel.org/linux-xfs/Zxnp8bma2KrMDg5m@li-bb2b2a4c-3307-11b2-a85c-8fa5c3a69313.ibm.com/T/#m3664bbe00287d98caa690bb04f51d0ef164f52b3

>> You pointed it right that it should be fixed in iomap. However do you
>> think filesystems can still keep this as safety net (maybe no need of
>> WARN_ON_ONCE).
>
> I don't see any point in adding "impossible to hit" checks into
> filesystems just in case some core infrastructure has a bug
> introduced....
>

So even though we have taken care of that case from page cache
invalidation code, however it can still happen if iomap_iter()
ever returns -ENOTBLK.  

e.g. 

    blk_start_plug(&plug);
	while ((ret = iomap_iter(&iomi, ops)) > 0) {
		iomi.processed = iomap_dio_iter(&iomi, dio);

		/*
		 * We can only poll for single bio I/Os.
		 */
		iocb->ki_flags &= ~IOCB_HIPRI;
	}

	blk_finish_plug(&plug);

	/*
	 * We only report that we've read data up to i_size.
	 * Revert iter to a state corresponding to that as some callers (such
	 * as the splice code) rely on it.
	 */
	if (iov_iter_rw(iter) == READ && iomi.pos >= dio->i_size)
		iov_iter_revert(iter, iomi.pos - dio->i_size);

	if (ret == -EFAULT && dio->size && (dio_flags & IOMAP_DIO_PARTIAL)) {
		if (!(iocb->ki_flags & IOCB_NOWAIT))
			wait_for_completion = true;
		ret = 0;
	}

	/* magic error code to fall back to buffered I/O */
	if (ret == -ENOTBLK) {
		wait_for_completion = true;
		ret = 0;
	}

Reviewing the code paths there is a lot of ping pongs between core iomap
and FS. So it's not just core iomap what we are talking about here.

So I am still inclined towards having that check in place as a safety net. 
However - let me take some time to review some of this code paths
please. I wanted to send this email mainly to mention the point that
page cache invalidation case is already taken care in iomap for atomic
writes, so there is no bug there. 

I will get back on rest of the cases after I have looked more closely at it.

> -Dave.
>
> -- 
> Dave Chinner
> david@fromorbit.com

Thanks for the review!
-ritesh


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/6] ext4: Warn if we ever fallback to buffered-io for DIO atomic writes
  2024-10-28  5:26       ` Dave Chinner
  2024-10-28  8:43         ` Ritesh Harjani
@ 2024-10-28 18:14         ` Ritesh Harjani
  2024-10-29 22:29           ` Dave Chinner
  1 sibling, 1 reply; 39+ messages in thread
From: Ritesh Harjani @ 2024-10-28 18:14 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-ext4, Theodore Ts'o, Jan Kara, Darrick J . Wong,
	Christoph Hellwig, John Garry, Ojaswin Mujoo, linux-kernel,
	linux-xfs, linux-fsdevel


Hi Dave, 

Dave Chinner <david@fromorbit.com> writes:

> On Mon, Oct 28, 2024 at 06:39:36AM +0530, Ritesh Harjani wrote:
>> 
>> Hi Dave, 
>> 
>> Dave Chinner <david@fromorbit.com> writes:
>> 
>> > On Fri, Oct 25, 2024 at 09:15:53AM +0530, Ritesh Harjani (IBM) wrote:
>> >> iomap will not return -ENOTBLK in case of dio atomic writes. But let's
>> >> also add a WARN_ON_ONCE and return -EIO as a safety net.
>> >> 
>> >> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>> >> ---
>> >>  fs/ext4/file.c | 10 +++++++++-
>> >>  1 file changed, 9 insertions(+), 1 deletion(-)
>> >> 
>> >> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
>> >> index f9516121a036..af6ebd0ac0d6 100644
>> >> --- a/fs/ext4/file.c
>> >> +++ b/fs/ext4/file.c
>> >> @@ -576,8 +576,16 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
>> >>  		iomap_ops = &ext4_iomap_overwrite_ops;
>> >>  	ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops,
>> >>  			   dio_flags, NULL, 0);
>> >> -	if (ret == -ENOTBLK)
>> >> +	if (ret == -ENOTBLK) {
>> >>  		ret = 0;
>> >> +		/*
>> >> +		 * iomap will never return -ENOTBLK if write fails for atomic
>> >> +		 * write. But let's just add a safety net.
>> >> +		 */
>> >> +		if (WARN_ON_ONCE(iocb->ki_flags & IOCB_ATOMIC))
>> >> +			ret = -EIO;
>> >> +	}
>> >
>> > Why can't the iomap code return EIO in this case for IOCB_ATOMIC?
>> > That way we don't have to put this logic into every filesystem.
>> 
>> This was origially intended as a safety net hence the WARN_ON_ONCE.
>> Later Darrick pointed out that we still might have an unconverted
>> condition in iomap which can return ENOTBLK for DIO atomic writes (page
>> cache invalidation).
>
> Yes. That's my point - iomap knows that it's an atomic write, it
> knows that invalidation failed, and it knows that there is no such
> thing as buffered atomic writes. So there is no possible fallback
> here, and it should be returning EIO in the page cache invalidation
> failure case and not ENOTBLK.
>

So the iomap DIO can return following as return values which can make
some filesystems fallback to buffered-io (if they implement fallback
logic) - 
(1) -ENOTBLK -> this is only returned for pagecache invalidation failure.
(2) 0 or partial write size -> This can never happen for atomic writes
(since we are only allowing for single fsblock as of now).

Now looking at XFS, it never fallsback to buffered-io ever except just 2
cases - 
1. When pagecache invalidation fails in iomap (can never happen for
atomic writes)
2. On unaligned DIO writes to reflinked CoW (not possible for atomic writes)

So it anyways should never happen that XFS ever fallback to buffered-io
for DIO atomic writes. Even today it does not fallback to buffered-io
for non-atomic short DIO writes.

>> You pointed it right that it should be fixed in iomap. However do you
>> think filesystems can still keep this as safety net (maybe no need of
>> WARN_ON_ONCE).
>
> I don't see any point in adding "impossible to hit" checks into
> filesystems just in case some core infrastructure has a bug
> introduced....

Yes, that is true for XFS. EXT4 however can return -ENOTBLK for short
writes, though it should not happen for current atomic write case where
we are only allowing for 1 fsblock. 

However given there are several places in EXT4 which has got fallback logic,
I would still like to go with a WARN_ON_ONCE where we are calling ext4
buffered-io handling for DIO fallback writes. This is to catch any bugs
even in future when we move to multi-fsblock case (until we have atomic
write support for buffered-io).

Thanks!
-ritesh

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/6] ext4: Warn if we ever fallback to buffered-io for DIO atomic writes
  2024-10-28 18:14         ` Ritesh Harjani
@ 2024-10-29 22:29           ` Dave Chinner
  2024-10-29 23:51             ` Ritesh Harjani
  0 siblings, 1 reply; 39+ messages in thread
From: Dave Chinner @ 2024-10-29 22:29 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: linux-ext4, Theodore Ts'o, Jan Kara, Darrick J . Wong,
	Christoph Hellwig, John Garry, Ojaswin Mujoo, linux-kernel,
	linux-xfs, linux-fsdevel

On Mon, Oct 28, 2024 at 11:44:00PM +0530, Ritesh Harjani wrote:
> 
> Hi Dave, 
> 
> Dave Chinner <david@fromorbit.com> writes:
> 
> > On Mon, Oct 28, 2024 at 06:39:36AM +0530, Ritesh Harjani wrote:
> >> 
> >> Hi Dave, 
> >> 
> >> Dave Chinner <david@fromorbit.com> writes:
> >> 
> >> > On Fri, Oct 25, 2024 at 09:15:53AM +0530, Ritesh Harjani (IBM) wrote:
> >> >> iomap will not return -ENOTBLK in case of dio atomic writes. But let's
> >> >> also add a WARN_ON_ONCE and return -EIO as a safety net.
> >> >> 
> >> >> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> >> >> ---
> >> >>  fs/ext4/file.c | 10 +++++++++-
> >> >>  1 file changed, 9 insertions(+), 1 deletion(-)
> >> >> 
> >> >> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> >> >> index f9516121a036..af6ebd0ac0d6 100644
> >> >> --- a/fs/ext4/file.c
> >> >> +++ b/fs/ext4/file.c
> >> >> @@ -576,8 +576,16 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
> >> >>  		iomap_ops = &ext4_iomap_overwrite_ops;
> >> >>  	ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops,
> >> >>  			   dio_flags, NULL, 0);
> >> >> -	if (ret == -ENOTBLK)
> >> >> +	if (ret == -ENOTBLK) {
> >> >>  		ret = 0;
> >> >> +		/*
> >> >> +		 * iomap will never return -ENOTBLK if write fails for atomic
> >> >> +		 * write. But let's just add a safety net.
> >> >> +		 */
> >> >> +		if (WARN_ON_ONCE(iocb->ki_flags & IOCB_ATOMIC))
> >> >> +			ret = -EIO;
> >> >> +	}
> >> >
> >> > Why can't the iomap code return EIO in this case for IOCB_ATOMIC?
> >> > That way we don't have to put this logic into every filesystem.
> >> 
> >> This was origially intended as a safety net hence the WARN_ON_ONCE.
> >> Later Darrick pointed out that we still might have an unconverted
> >> condition in iomap which can return ENOTBLK for DIO atomic writes (page
> >> cache invalidation).
> >
> > Yes. That's my point - iomap knows that it's an atomic write, it
> > knows that invalidation failed, and it knows that there is no such
> > thing as buffered atomic writes. So there is no possible fallback
> > here, and it should be returning EIO in the page cache invalidation
> > failure case and not ENOTBLK.
> >
> 
> So the iomap DIO can return following as return values which can make
> some filesystems fallback to buffered-io (if they implement fallback
> logic) - 
> (1) -ENOTBLK -> this is only returned for pagecache invalidation failure.
> (2) 0 or partial write size -> This can never happen for atomic writes
> (since we are only allowing for single fsblock as of now).

Even when we allow multi-FSB atomic writes, the definition of
atomic write is still "all or nothing". There is no scope for "short
writes" when IOCB_ATOMIC is set - any condition that means we can't
write the entire IO as a single bio, we need to abort and return
EINVAL.

Hence -ENOTBLK should never be returned by iomap for atomic DIO
writes - we need to say -EINVAL if the write could not be issued
atomically for whatever reason it may be so the application knows
that atomic IO submission was not possible for that IO.

> Now looking at XFS, it never fallsback to buffered-io ever except just 2
> cases - 
> 1. When pagecache invalidation fails in iomap (can never happen for
> atomic writes)

Why can't this happen for atomic DIO writes?  It's the same failure
cases as for normal DIO writes, isn't it? (i.e. race with mmap
writes)

My point is that if it's an atomic write, this failure should get
turned into -EINVAL by the iomap code. We do not want a fallback to
buffered IO when this situation happens for atomic IO.

> 2. On unaligned DIO writes to reflinked CoW (not possible for atomic writes)

This path doesn't ever go through iomap - XFS catches that case
before it calls into iomap, so it's not relevant to how iomap
behaves w.r.t atomic IO.

> So it anyways should never happen that XFS ever fallback to buffered-io
> for DIO atomic writes. Even today it does not fallback to buffered-io
> for non-atomic short DIO writes.
> 
> >> You pointed it right that it should be fixed in iomap. However do you
> >> think filesystems can still keep this as safety net (maybe no need of
> >> WARN_ON_ONCE).
> >
> > I don't see any point in adding "impossible to hit" checks into
> > filesystems just in case some core infrastructure has a bug
> > introduced....
> 
> Yes, that is true for XFS. EXT4 however can return -ENOTBLK for short
> writes, though it should not happen for current atomic write case where
> we are only allowing for 1 fsblock. 

Yes, but the -ENOTBLK error returned from ext4_iomap_end() if
nothing was written does not get returned to ext4 from
__iomap_dio_rw(). It is consumed by the iomap code:

	/* magic error code to fall back to buffered I/O */
        if (ret == -ENOTBLK) {
                wait_for_completion = true;
                ret = 0;
	}

This means that all the IO that was issued gets completed before
returning to the caller and that's how the short write comes about.

-ENOTBLK is *not returned to the caller* on a short write -
iomap_dio_rw will return 0 (success).  The caller then has to look
at the iov_iter state to determine if the write was fully completed.
This is exactly what the ext4 code currently does for all DIO
writes, not just those that return -ENOTBLK.

> I would still like to go with a WARN_ON_ONCE where we are calling ext4
> buffered-io handling for DIO fallback writes. This is to catch any bugs
> even in future when we move to multi-fsblock case (until we have atomic
> write support for buffered-io).

Your choice, but please realise that it is not going to catch short
atomic writes at all.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/6] ext4: Warn if we ever fallback to buffered-io for DIO atomic writes
  2024-10-29 22:29           ` Dave Chinner
@ 2024-10-29 23:51             ` Ritesh Harjani
  0 siblings, 0 replies; 39+ messages in thread
From: Ritesh Harjani @ 2024-10-29 23:51 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-ext4, Theodore Ts'o, Jan Kara, Darrick J . Wong,
	Christoph Hellwig, John Garry, Ojaswin Mujoo, linux-kernel,
	linux-xfs, linux-fsdevel


Hi Dave, 

Dave Chinner <david@fromorbit.com> writes:

> On Mon, Oct 28, 2024 at 11:44:00PM +0530, Ritesh Harjani wrote:
>> 
>> Hi Dave, 
>> 
>> Dave Chinner <david@fromorbit.com> writes:
>> 
>> > On Mon, Oct 28, 2024 at 06:39:36AM +0530, Ritesh Harjani wrote:
>> >> 
>> >> Hi Dave, 
>> >> 
>> >> Dave Chinner <david@fromorbit.com> writes:
>> >> 
>> >> > On Fri, Oct 25, 2024 at 09:15:53AM +0530, Ritesh Harjani (IBM) wrote:
>> >> >> iomap will not return -ENOTBLK in case of dio atomic writes. But let's
>> >> >> also add a WARN_ON_ONCE and return -EIO as a safety net.
>> >> >> 
>> >> >> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>> >> >> ---
>> >> >>  fs/ext4/file.c | 10 +++++++++-
>> >> >>  1 file changed, 9 insertions(+), 1 deletion(-)
>> >> >> 
>> >> >> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
>> >> >> index f9516121a036..af6ebd0ac0d6 100644
>> >> >> --- a/fs/ext4/file.c
>> >> >> +++ b/fs/ext4/file.c
>> >> >> @@ -576,8 +576,16 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
>> >> >>  		iomap_ops = &ext4_iomap_overwrite_ops;
>> >> >>  	ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops,
>> >> >>  			   dio_flags, NULL, 0);
>> >> >> -	if (ret == -ENOTBLK)
>> >> >> +	if (ret == -ENOTBLK) {
>> >> >>  		ret = 0;
>> >> >> +		/*
>> >> >> +		 * iomap will never return -ENOTBLK if write fails for atomic
>> >> >> +		 * write. But let's just add a safety net.
>> >> >> +		 */
>> >> >> +		if (WARN_ON_ONCE(iocb->ki_flags & IOCB_ATOMIC))
>> >> >> +			ret = -EIO;
>> >> >> +	}
>> >> >
>> >> > Why can't the iomap code return EIO in this case for IOCB_ATOMIC?
>> >> > That way we don't have to put this logic into every filesystem.
>> >> 
>> >> This was origially intended as a safety net hence the WARN_ON_ONCE.
>> >> Later Darrick pointed out that we still might have an unconverted
>> >> condition in iomap which can return ENOTBLK for DIO atomic writes (page
>> >> cache invalidation).
>> >
>> > Yes. That's my point - iomap knows that it's an atomic write, it
>> > knows that invalidation failed, and it knows that there is no such
>> > thing as buffered atomic writes. So there is no possible fallback
>> > here, and it should be returning EIO in the page cache invalidation
>> > failure case and not ENOTBLK.
>> >
>> 
>> So the iomap DIO can return following as return values which can make
>> some filesystems fallback to buffered-io (if they implement fallback
>> logic) - 
>> (1) -ENOTBLK -> this is only returned for pagecache invalidation failure.
>> (2) 0 or partial write size -> This can never happen for atomic writes
>> (since we are only allowing for single fsblock as of now).
>
> Even when we allow multi-FSB atomic writes, the definition of
> atomic write is still "all or nothing". There is no scope for "short
> writes" when IOCB_ATOMIC is set - any condition that means we can't
> write the entire IO as a single bio, we need to abort and return
> EINVAL.

yes. As long as it is a single bio, I agree even the short write
condition should not hit based on the current iomap code.

>
> Hence -ENOTBLK should never be returned by iomap for atomic DIO
> writes - we need to say -EINVAL if the write could not be issued
> atomically for whatever reason it may be so the application knows
> that atomic IO submission was not possible for that IO.
>

Agreed Dave. That is what iomap is doing today for atomic write code. 
(Except maybe one minor difference where it returns -EAGAIN in case of
page cache invalidation assuming the failure maybe transient and the
request could be tried again).


	
>> Now looking at XFS, it never fallsback to buffered-io ever except just 2
>> cases - 
>> 1. When pagecache invalidation fails in iomap (can never happen for
>> atomic writes)
>
> Why can't this happen for atomic DIO writes?  It's the same failure
> cases as for normal DIO writes, isn't it? (i.e. race with mmap
> writes)
>

I meant after the patch which adds atomic write support in iomap code
from John, make sure we don't return -ENOTBLK in case of atomic write request. 


> My point is that if it's an atomic write, this failure should get
> turned into -EINVAL by the iomap code. We do not want a fallback to
> buffered IO when this situation happens for atomic IO.
>
>> 2. On unaligned DIO writes to reflinked CoW (not possible for atomic writes)
>
> This path doesn't ever go through iomap - XFS catches that case
> before it calls into iomap, so it's not relevant to how iomap
> behaves w.r.t atomic IO.
>

Right.

>> So it anyways should never happen that XFS ever fallback to buffered-io
>> for DIO atomic writes. Even today it does not fallback to buffered-io
>> for non-atomic short DIO writes.
>> 
>> >> You pointed it right that it should be fixed in iomap. However do you
>> >> think filesystems can still keep this as safety net (maybe no need of
>> >> WARN_ON_ONCE).
>> >
>> > I don't see any point in adding "impossible to hit" checks into
>> > filesystems just in case some core infrastructure has a bug
>> > introduced....
>> 
>> Yes, that is true for XFS. EXT4 however can return -ENOTBLK for short
>> writes, though it should not happen for current atomic write case where
>> we are only allowing for 1 fsblock. 
>
> Yes, but the -ENOTBLK error returned from ext4_iomap_end() if
> nothing was written does not get returned to ext4 from
> __iomap_dio_rw(). It is consumed by the iomap code:
>
> 	/* magic error code to fall back to buffered I/O */
>         if (ret == -ENOTBLK) {
>                 wait_for_completion = true;
>                 ret = 0;
> 	}
>
> This means that all the IO that was issued gets completed before
> returning to the caller and that's how the short write comes about.
>
> -ENOTBLK is *not returned to the caller* on a short write -

yes. That's my understanding too of the short write case handling in
iomap.

> iomap_dio_rw will return 0 (success).  The caller then has to look
> at the iov_iter state to determine if the write was fully completed.
> This is exactly what the ext4 code currently does for all DIO
> writes, not just those that return -ENOTBLK.
>

yes. Agreed.

>> I would still like to go with a WARN_ON_ONCE where we are calling ext4
>> buffered-io handling for DIO fallback writes. This is to catch any bugs
>> even in future when we move to multi-fsblock case (until we have atomic
>> write support for buffered-io).
>
> Your choice, but please realise that it is not going to catch short
> atomic writes at all.
>

Thanks Dave. Yes, I would like to maybe keep a WARN_ON_ONCE since ext4
has a fallback handling logic where a short DIO or -ENOTBLK case could
be later handled by buffered-io logic (though I agree iomap won't let it
happen for atomic write case). 

But a WARN_ON_ONCE just before buffered-io fallback handling logic in
ext4 DIO path would be my preferred choice only to make sure we could
catch any unwanted bugs in future too.

So I was thinking of this change instead - 


diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 8116bd78910b..61787a37e9d4 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -599,6 +599,13 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
                ssize_t err;
                loff_t endbyte;

+               /*
+                * There is no support for atomic writes on buffered-io yet,
+                * we should never fallback to buffered-io for DIO atomic
+                * writes.
+                */
+               WARN_ON_ONCE(iocb->ki_flags & IOCB_ATOMIC);
+
                offset = iocb->ki_pos;
                err = ext4_buffered_write_iter(iocb, from);
                if (err < 0)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index fcdee27b9aa2..26b3c84d7f64 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3449,12 +3449,16 @@ static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
 {
        /*
         * Check to see whether an error occurred while writing out the data to
-        * the allocated blocks. If so, return the magic error code so that we
-        * fallback to buffered I/O and attempt to complete the remainder of
-        * the I/O. Any blocks that may have been allocated in preparation for
-        * the direct I/O will be reused during buffered I/O.
+        * the allocated blocks. If so, return the magic error code for
+        * non-atomic write so that we fallback to buffered I/O and attempt to
+        * complete the remainder of the I/O.
+        * For atomic writes we will simply fail the I/O request if we coudn't
+        * write anything. For non-atomic writes, any blocks that may have been
+        * allocated in preparation for the direct I/O will be reused during
+        * buffered I/O.
         */
-       if (flags & (IOMAP_WRITE | IOMAP_DIRECT) && written == 0)
+       if (!(flags & IOMAP_ATOMIC) && (flags & (IOMAP_WRITE | IOMAP_DIRECT))
+                       && written == 0)
                return -ENOTBLK;

        return 0;


> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

Thanks a lot for the review!

-ritesh

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/6] iomap: Lift blocksize restriction on atomic writes
  2024-10-26  4:35                         ` Ritesh Harjani
@ 2024-10-31 21:36                           ` Darrick J. Wong
  2024-11-04  1:52                             ` Dave Chinner
  0 siblings, 1 reply; 39+ messages in thread
From: Darrick J. Wong @ 2024-10-31 21:36 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: Theodore Ts'o, John Garry, linux-ext4, Jan Kara,
	Christoph Hellwig, Ojaswin Mujoo, Dave Chinner, linux-kernel,
	linux-xfs, linux-fsdevel

On Sat, Oct 26, 2024 at 10:05:44AM +0530, Ritesh Harjani wrote:
> "Darrick J. Wong" <djwong@kernel.org> writes:
> 
> > On Fri, Oct 25, 2024 at 07:43:17PM +0530, Ritesh Harjani wrote:
> >> John Garry <john.g.garry@oracle.com> writes:
> >> 
> >> > On 25/10/2024 13:36, Ritesh Harjani (IBM) wrote:
> >> >>>> So user will anyway will have to be made aware of not to
> >> >>>> attempt writes of fashion which can cause them such penalties.
> >> >>>>
> >> >>>> As patch-6 mentions this is a base support for bs = ps systems for
> >> >>>> enabling atomic writes using bigalloc. For now we return -EINVAL when we
> >> >>>> can't allocate a continuous user requested mapping which means it won't
> >> >>>> support operations of types 8k followed by 16k.
> >> >>>>
> >> >>> That's my least-preferred option.
> >> >>>
> >> >>> I think better would be reject atomic writes that cover unwritten
> >> >>> extents always - but that boat is about to sail...
> >> >> That's what this patch does.
> >> >
> >> > Not really.
> >> >
> >> > Currently we have 2x iomap restrictions:
> >> > a. mapping length must equal fs block size
> >> > b. bio created must equal total write size
> >> >
> >> > This patch just says that the mapping length must equal total write size 
> >> > (instead of a.). So quite similar to b.
> >> >
> >> >> For whatever reason if we couldn't allocate
> >> >> a single contiguous region of requested size for atomic write, then we
> >> >> reject the request always, isn't it. Or maybe I didn't understand your comment.
> >> >
> >> > As the simplest example, for an atomic write to an empty file, there 
> >> > should only be a single mapping returned to iomap_dio_bio_iter() and 
> >> > that would be of IOMAP_UNWRITTEN type. And we don't reject that.
> >> >
> >> 
> >> Ok. Maybe this is what I am missing. Could you please help me understand
> >> why should such writes be rejected? 
> >> 
> >> For e.g. 
> >> If FS could allocate a single contiguous IOMAP_UNWRITTEN extent of
> >> atomic write request size, that means - 
> >> 1. FS will allocate an unwritten extent.
> >> 2. will do writes (using submit_bio) to the unwritten extent. 
> >> 3. will do unwritten to written conversion. 
> >> 
> >> It is ok if either of the above operations fail right? If (3) fails
> >> then the region will still be marked unwritten that means it will read
> >> zero (old contents). (2) can anyway fail and will not result into
> >> partial writes. (1) will anyway not result into any write whatsoever.
> >> 
> >> So we can never have a situation where there is partial writes leading
> >> to mix of old and new write contents right for such cases? Which is what the
> >> requirement of atomic/untorn write also is?
> >> 
> >> Sorry am I missing something here?
> >
> > I must be missing something; to perform an untorn write, two things must
> > happen --
> >
> > 1. The kernel writes the data to the storage device, and the storage
> > device either persists all of it, or throws back an error having
> > persisted none of it.
> >
> > 2. If (1) completes successfully, all file mapping updates for the range
> > written must be persisted, or an error is thrown back and none of them
> > are persisted.
> >
> > iomap doesn't have to know how the filesystem satisfies (2); it just has
> > to create a single bio containing all data pages or it rejects the
> > write.
> >
> > Currently, it's an implementation detail that the XFS directio write
> > ioend code processes the file mapping updates for the range written by
> > walking every extent mapping for that range and issuing separate
> > transactions for each mapping update.  There's nothing that can restart
> > the walk if it is interrupted.  That's why XFS cannot support multi
> > fsblock untorn writes to blocks with different status.
> >
> > As I've said before, the most general solution to this would be to add a
> > new log intent item that would track the "update all mappings in this
> > file range" operation so that recovery could restart the walk.  This is
> > the most technically challenging, so we decided not to implement it
> > until there is demand.
> >
> > Having set aside the idea of redesigning ioend, the second-most general
> > solution is pre-zeroing unwritten extents and holes so that
> > ->iomap_begin implementations can present a single mapping to the bio
> > constructor.  Technically if there's only one unwritten extent or hole
> > or cow, xfs can actually satisfy (2) because it only creates one
> > transaction.
> >
> > This gets me to the third and much less general solution -- only allow
> > untorn writes if we know that the ioend only ever has to run a single
> > transaction.  That's why untorn writes are limited to a single fsblock
> > for now -- it's a simple solution so that we can get our downstream
> > customers to kick the tires and start on the next iteration instead of
> > spending years on waterfalling.
> >
> > Did you notice that in all of these cases, the capabilities of the
> > filesystem's ioend processing determines the restrictions on the number
> > and type of mappings that ->iomap_begin can give to iomap?
> >
> > Now that we have a second system trying to hook up to the iomap support,
> > it's clear to me that the restrictions on mappings are specific to each
> > filesystem.  Therefore, the iomap directio code should not impose
> > restrictions on the mappings it receives unless they would prevent the
> > creation of the single aligned bio.
> >
> > Instead, xfs_direct_write_iomap_begin and ext4_iomap_begin should return
> > EINVAL or something if they look at the file mappings and discover that
> > they cannot perform the ioend without risking torn mapping updates.  In
> > the long run, ->iomap_begin is where this iomap->len <= iter->len check
> > really belongs, but hold that thought.
> >
> > For the multi fsblock case, the ->iomap_begin functions would have to
> > check that only one metadata update would be necessary in the ioend.
> > That's where things get murky, since ext4/xfs drop their mapping locks
> > between calls to ->iomap_begin.  So you'd have to check all the mappings
> > for unsupported mixed state every time.  Yuck.
> >
> 
> Thanks Darrick for taking time summarizing what all has been done
> and your thoughts here.
> 
> > It might be less gross to retain the restriction that iomap accepts only
> > one mapping for the entire file range, like Ritesh has here.
> 
> less gross :) sure. 
> 
> I would like to think of this as, being less restrictive (compared to
> only allowing a single fsblock) by adding a constraint on the atomic
> write I/O request i.e.  
> 
> "Atomic write I/O request to a region in a file is only allowed if that
> region has no partially allocated extents. Otherwise, the file system
> can fail the I/O operation by returning -EINVAL."
> 
> Essentially by adding this constraint to the I/O request, we are
> helping the user to prevent atomic writes from accidentally getting
> torned and also allowing multi-fsblock writes. So I still think that
> might be the right thing to do here or at least a better start. FS can
> later work on adding such support where we don't even need above
> such constraint on a given atomic write I/O request.

On today's ext4 call, Ted and Ritesh and I realized that there's a bit
more to it than this -- it's not possible to support untorn writes to a
mix of written/(cow,unwritten) mappings even if they all point to the
same physical space.  If the system fails after the storage device
commits the write but before any of the ioend processing is scheduled, a
subsequent read of the previously written blocks will produce the new
data, but reads to the other areas will produce the old contents (or
zeroes, or whatever).  That's a torn write.

Therefore, iomap ought to stick to requiring that ->iomap_begin returns
a single iomap to cover the entire file range for the untorn write.  For
an unwritten extent, the post-recovery read will see either zeroes or
the new contents; for a single-mapping COW it'll see old or new contents
but not both.

(Obviously this still requires that the fs can perform the mapping
updates without tearing too.)

--D

> > Users
> > might be ok with us saying that you can't do a 16k atomic write to a
> > region where you previously did an 8k write until you write the other
> > 8k, even if someone has to write zeroes.  Users might be ok with the
> > kernel allowing multi-fsblock writes but only if the stars align.
> 
> > But
> > to learn the answers to those questions, we have to put /something/ in
> > the hands of our users.
> 
> On this point, I think ext4 might already has those users who might be
> using atomic write characteristics of devices to do untorn writes. e.g. 
> 
> In [1], Ted has talked about using bigalloc with ext4 for torn write
> prevention. [2] talks about using ext4 with bigalloc to prevent torn
> writes on aws cloud.
> 
> [1]: https://www.youtube.com/watch?v=gIeuiGg-_iw
> [2]: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/configure-twp.html
> 
> My point being - Looks like the class of users who are using untorn
> writes to improve their database performances are already doing so even
> w/o any such interfaces being exposed to them (with ext4 bigalloc).
> 
> The current feature support of allowing atomic writes to only single
> fsblock might not be helpful to these users who can provide that
> feedback, who are using ext4 on bs = ps systems with bigalloc. But maybe
> let's wait and hear from them whether it is ok if -   
> 
> "Atomic write I/O request to a region in a file is only allowed if that
> region has no partially allocated extents. Otherwise, the file system
> can fail the I/O operation by returning -EINVAL."
> 
> >
> > For now (because we're already at -rc5), let's have xfs/ext4's
> > ->write_iter implementations restrict atomic writes to a single fsblock,
> > and get both merged into the kernel.
> 
> Yes, I agree with the approach. I agree that we should get a consensus
> on this from folks.
> 
> Let me split this series up and address the review comments on patch
> [1-4]. Patch-5 & 6 can be worked once we have conclusion on this and can
> be eyed for 6.14.
> 
> > Let's defer the multi fsblock work
> > to 6.14, though I think we could take this patch.
> 
> It's ok to consider this patch along with multi-fsblock work then i.e.
> for 6.14.
> 
> >
> > Does that sound cool?
> >
> > --D
> 
> Thanks Darrick :)
> 
> -ritesh
> 
> >> >> 
> >> >> If others prefer - we can maybe add such a check (e.g. ext4_dio_atomic_write_checks())
> >> >> for atomic writes in ext4_dio_write_checks(), similar to how we detect
> >> >> overwrites case to decide whether we need a read v/s write semaphore.
> >> >> So this can check if the user has a partially allocated extent for the
> >> >> user requested region and if yes, we can return -EINVAL from
> >> >> ext4_dio_write_iter() itself.
> >> >  > > I think this maybe better option than waiting until ->iomap_begin().
> >> >> This might also bring all atomic write constraints to be checked in one
> >> >> place i.e. during ext4_file_write_iter() itself.
> >> >
> >> > Something like this can be done once we decide how atomic writing to 
> >> > regions which cover mixed unwritten and written extents is to be handled.
> >> 
> >> Mixed extent regions (written + unwritten) is a different case all
> >> together (which can lead to mix of old and new contents).
> >> 
> >> 
> >> But here what I am suggesting is to add following constraint in case of
> >> ext4 with bigalloc - 
> >> 
> >> "Writes to a region which already has partially allocated extent is not supported."
> >> 
> >> That means we will return -EINVAL if we detect above case in
> >> ext4_file_write_iter() and sure we can document this behavior.
> >> 
> >> In retrospect, I am not sure why we cannot add a constraint for atomic
> >> writes (e.g. for ext4 bigalloc) and reject such writes outright,
> >> instead of silently incurring a performance penalty by zeroing out the
> >> partial regions by allowing such write request.
> >> 
> >> -ritesh
> >> 
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/6] iomap: Lift blocksize restriction on atomic writes
  2024-10-31 21:36                           ` Darrick J. Wong
@ 2024-11-04  1:52                             ` Dave Chinner
  2024-11-05  0:09                               ` Darrick J. Wong
  0 siblings, 1 reply; 39+ messages in thread
From: Dave Chinner @ 2024-11-04  1:52 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Ritesh Harjani, Theodore Ts'o, John Garry, linux-ext4,
	Jan Kara, Christoph Hellwig, Ojaswin Mujoo, linux-kernel,
	linux-xfs, linux-fsdevel

On Thu, Oct 31, 2024 at 02:36:40PM -0700, Darrick J. Wong wrote:
> On Sat, Oct 26, 2024 at 10:05:44AM +0530, Ritesh Harjani wrote:
> > > This gets me to the third and much less general solution -- only allow
> > > untorn writes if we know that the ioend only ever has to run a single
> > > transaction.  That's why untorn writes are limited to a single fsblock
> > > for now -- it's a simple solution so that we can get our downstream
> > > customers to kick the tires and start on the next iteration instead of
> > > spending years on waterfalling.
> > >
> > > Did you notice that in all of these cases, the capabilities of the
> > > filesystem's ioend processing determines the restrictions on the number
> > > and type of mappings that ->iomap_begin can give to iomap?
> > >
> > > Now that we have a second system trying to hook up to the iomap support,
> > > it's clear to me that the restrictions on mappings are specific to each
> > > filesystem.  Therefore, the iomap directio code should not impose
> > > restrictions on the mappings it receives unless they would prevent the
> > > creation of the single aligned bio.
> > >
> > > Instead, xfs_direct_write_iomap_begin and ext4_iomap_begin should return
> > > EINVAL or something if they look at the file mappings and discover that
> > > they cannot perform the ioend without risking torn mapping updates.  In
> > > the long run, ->iomap_begin is where this iomap->len <= iter->len check
> > > really belongs, but hold that thought.
> > >
> > > For the multi fsblock case, the ->iomap_begin functions would have to
> > > check that only one metadata update would be necessary in the ioend.
> > > That's where things get murky, since ext4/xfs drop their mapping locks
> > > between calls to ->iomap_begin.  So you'd have to check all the mappings
> > > for unsupported mixed state every time.  Yuck.
> > >
> > 
> > Thanks Darrick for taking time summarizing what all has been done
> > and your thoughts here.
> > 
> > > It might be less gross to retain the restriction that iomap accepts only
> > > one mapping for the entire file range, like Ritesh has here.
> > 
> > less gross :) sure. 
> > 
> > I would like to think of this as, being less restrictive (compared to
> > only allowing a single fsblock) by adding a constraint on the atomic
> > write I/O request i.e.  
> > 
> > "Atomic write I/O request to a region in a file is only allowed if that
> > region has no partially allocated extents. Otherwise, the file system
> > can fail the I/O operation by returning -EINVAL."
> > 
> > Essentially by adding this constraint to the I/O request, we are
> > helping the user to prevent atomic writes from accidentally getting
> > torned and also allowing multi-fsblock writes. So I still think that
> > might be the right thing to do here or at least a better start. FS can
> > later work on adding such support where we don't even need above
> > such constraint on a given atomic write I/O request.
> 
> On today's ext4 call, Ted and Ritesh and I realized that there's a bit
> more to it than this -- it's not possible to support untorn writes to a
> mix of written/(cow,unwritten) mappings even if they all point to the
> same physical space.  If the system fails after the storage device
> commits the write but before any of the ioend processing is scheduled, a
> subsequent read of the previously written blocks will produce the new
> data, but reads to the other areas will produce the old contents (or
> zeroes, or whatever).  That's a torn write.

I'm *really* surprised that people are only realising that IO
completion processing for atomic writes *must be atomic*.

This was the foundational constraint that the forced alignment
proposal for XFS was intended to address. i.e. to prevent fs
operations from violating atomic write IO constraints (e.g. punching
sub-atomic write size holes in the file) so that the physical IO can
be done without tearing and the IO completion processing that
exposes the new data can be done atomically.

> Therefore, iomap ought to stick to requiring that ->iomap_begin returns
> a single iomap to cover the entire file range for the untorn write.  For
> an unwritten extent, the post-recovery read will see either zeroes or
> the new contents; for a single-mapping COW it'll see old or new contents
> but not both.

I'm pretty sure we enforced that in the XFS mapping implemention for
atomic writes using forced alignment. i.e.  we have to return a
correctly aligned, contiguous mapping to iomap or we have to return
-EINVAL to indicate atomic write mapping failed.

Yes, we can check this in iomap, but it's really the filesystem that
has to implement and enforce it...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/6] iomap: Lift blocksize restriction on atomic writes
  2024-11-04  1:52                             ` Dave Chinner
@ 2024-11-05  0:09                               ` Darrick J. Wong
  0 siblings, 0 replies; 39+ messages in thread
From: Darrick J. Wong @ 2024-11-05  0:09 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ritesh Harjani, Theodore Ts'o, John Garry, linux-ext4,
	Jan Kara, Christoph Hellwig, Ojaswin Mujoo, linux-kernel,
	linux-xfs, linux-fsdevel

On Mon, Nov 04, 2024 at 12:52:42PM +1100, Dave Chinner wrote:
> On Thu, Oct 31, 2024 at 02:36:40PM -0700, Darrick J. Wong wrote:
> > On Sat, Oct 26, 2024 at 10:05:44AM +0530, Ritesh Harjani wrote:
> > > > This gets me to the third and much less general solution -- only allow
> > > > untorn writes if we know that the ioend only ever has to run a single
> > > > transaction.  That's why untorn writes are limited to a single fsblock
> > > > for now -- it's a simple solution so that we can get our downstream
> > > > customers to kick the tires and start on the next iteration instead of
> > > > spending years on waterfalling.
> > > >
> > > > Did you notice that in all of these cases, the capabilities of the
> > > > filesystem's ioend processing determines the restrictions on the number
> > > > and type of mappings that ->iomap_begin can give to iomap?
> > > >
> > > > Now that we have a second system trying to hook up to the iomap support,
> > > > it's clear to me that the restrictions on mappings are specific to each
> > > > filesystem.  Therefore, the iomap directio code should not impose
> > > > restrictions on the mappings it receives unless they would prevent the
> > > > creation of the single aligned bio.
> > > >
> > > > Instead, xfs_direct_write_iomap_begin and ext4_iomap_begin should return
> > > > EINVAL or something if they look at the file mappings and discover that
> > > > they cannot perform the ioend without risking torn mapping updates.  In
> > > > the long run, ->iomap_begin is where this iomap->len <= iter->len check
> > > > really belongs, but hold that thought.
> > > >
> > > > For the multi fsblock case, the ->iomap_begin functions would have to
> > > > check that only one metadata update would be necessary in the ioend.
> > > > That's where things get murky, since ext4/xfs drop their mapping locks
> > > > between calls to ->iomap_begin.  So you'd have to check all the mappings
> > > > for unsupported mixed state every time.  Yuck.
> > > >
> > > 
> > > Thanks Darrick for taking time summarizing what all has been done
> > > and your thoughts here.
> > > 
> > > > It might be less gross to retain the restriction that iomap accepts only
> > > > one mapping for the entire file range, like Ritesh has here.
> > > 
> > > less gross :) sure. 
> > > 
> > > I would like to think of this as, being less restrictive (compared to
> > > only allowing a single fsblock) by adding a constraint on the atomic
> > > write I/O request i.e.  
> > > 
> > > "Atomic write I/O request to a region in a file is only allowed if that
> > > region has no partially allocated extents. Otherwise, the file system
> > > can fail the I/O operation by returning -EINVAL."
> > > 
> > > Essentially by adding this constraint to the I/O request, we are
> > > helping the user to prevent atomic writes from accidentally getting
> > > torned and also allowing multi-fsblock writes. So I still think that
> > > might be the right thing to do here or at least a better start. FS can
> > > later work on adding such support where we don't even need above
> > > such constraint on a given atomic write I/O request.
> > 
> > On today's ext4 call, Ted and Ritesh and I realized that there's a bit
> > more to it than this -- it's not possible to support untorn writes to a
> > mix of written/(cow,unwritten) mappings even if they all point to the
> > same physical space.  If the system fails after the storage device
> > commits the write but before any of the ioend processing is scheduled, a
> > subsequent read of the previously written blocks will produce the new
> > data, but reads to the other areas will produce the old contents (or
> > zeroes, or whatever).  That's a torn write.
> 
> I'm *really* surprised that people are only realising that IO
> completion processing for atomic writes *must be atomic*.

I've been saying for a while; it's just that I didn't realize until last
week that there were more rules than "can't do an untorn write if you
need to do more than 1 mapping update":

Untorn writes are not possible if:
1. More than 1 mapping update is needed
2. 1 mapping update is needed but there's a written block

(1) can be worked around with a log intent item for ioend processing,
but I don't think (2) can at all.  That's why I went back to saying that
untorn writes require that there can only be one mapping.

> This was the foundational constraint that the forced alignment
> proposal for XFS was intended to address. i.e. to prevent fs
> operations from violating atomic write IO constraints (e.g. punching
> sub-atomic write size holes in the file) so that the physical IO can
> be done without tearing and the IO completion processing that
> exposes the new data can be done atomically.
> 
> > Therefore, iomap ought to stick to requiring that ->iomap_begin returns
> > a single iomap to cover the entire file range for the untorn write.  For
> > an unwritten extent, the post-recovery read will see either zeroes or
> > the new contents; for a single-mapping COW it'll see old or new contents
> > but not both.
> 
> I'm pretty sure we enforced that in the XFS mapping implemention for
> atomic writes using forced alignment. i.e.  we have to return a
> correctly aligned, contiguous mapping to iomap or we have to return
> -EINVAL to indicate atomic write mapping failed.

I think you guys did, it's just the ext4 bigalloc thing that started
this back up again. :/

> Yes, we can check this in iomap, but it's really the filesystem that
> has to implement and enforce it...

I think we ought to keep the "1 mapping per untorn write" check in iomap
until someone decides that it's worth the trouble to make a filesystem
handle mixed states correctly.  Mostly as a guard against the
implementations.

--D

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2024-11-05  0:09 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-25  3:45 [PATCH 0/6] ext4: Add atomic write support for DIO Ritesh Harjani (IBM)
2024-10-25  3:45 ` [PATCH 1/6] ext4: Add statx support for atomic writes Ritesh Harjani (IBM)
2024-10-25  9:41   ` John Garry
2024-10-25 10:08     ` Ritesh Harjani
2024-10-25 16:09       ` Darrick J. Wong
2024-10-25 17:45         ` Ritesh Harjani
2024-10-25  3:45 ` [PATCH 2/6] ext4: Check for atomic writes support in write iter Ritesh Harjani (IBM)
2024-10-25  9:44   ` John Garry
2024-10-25 10:33     ` Ritesh Harjani
2024-10-25 16:11       ` Darrick J. Wong
2024-10-25 17:50         ` Ritesh Harjani
2024-10-25  3:45 ` [PATCH 3/6] ext4: Support setting FMODE_CAN_ATOMIC_WRITE Ritesh Harjani (IBM)
2024-10-25  3:45 ` [PATCH 4/6] ext4: Warn if we ever fallback to buffered-io for DIO atomic writes Ritesh Harjani (IBM)
2024-10-25 16:16   ` Darrick J. Wong
2024-10-25 17:51     ` Ritesh Harjani
2024-10-27 22:26   ` Dave Chinner
2024-10-28  1:09     ` Ritesh Harjani
2024-10-28  5:26       ` Dave Chinner
2024-10-28  8:43         ` Ritesh Harjani
2024-10-28 18:14         ` Ritesh Harjani
2024-10-29 22:29           ` Dave Chinner
2024-10-29 23:51             ` Ritesh Harjani
2024-10-25  3:45 ` [PATCH 5/6] iomap: Lift blocksize restriction on " Ritesh Harjani (IBM)
2024-10-25  8:52   ` John Garry
2024-10-25  9:31     ` Ritesh Harjani
2024-10-25  9:59       ` John Garry
2024-10-25 10:35         ` Ritesh Harjani
2024-10-25 11:07           ` John Garry
2024-10-25 11:19             ` Ritesh Harjani
2024-10-25 12:23               ` John Garry
2024-10-25 12:36                 ` Ritesh Harjani
2024-10-25 14:04                   ` John Garry
2024-10-25 14:13                     ` Ritesh Harjani
2024-10-25 18:28                       ` Darrick J. Wong
2024-10-26  4:35                         ` Ritesh Harjani
2024-10-31 21:36                           ` Darrick J. Wong
2024-11-04  1:52                             ` Dave Chinner
2024-11-05  0:09                               ` Darrick J. Wong
2024-10-25  3:45 ` [PATCH 6/6] ext4: Add atomic write support for bigalloc Ritesh Harjani (IBM)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).