linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 0/7] block atomic writes for xfs
@ 2024-09-30 12:54 John Garry
  2024-09-30 12:54 ` [PATCH v6 1/7] block/fs: Pass an iocb to generic_atomic_write_valid() John Garry
                   ` (6 more replies)
  0 siblings, 7 replies; 29+ messages in thread
From: John Garry @ 2024-09-30 12:54 UTC (permalink / raw)
  To: axboe, brauner, djwong, viro, jack, dchinner, hch, cem
  Cc: linux-block, linux-kernel, linux-xfs, linux-fsdevel, hare,
	martin.petersen, catherine.hoang, mcgrof, ritesh.list, ojaswin,
	John Garry

This series expands atomic write support to filesystems, specifically
XFS.

Flag FS_XFLAG_ATOMICWRITES is added as an enabling flag for atomic writes.

XFS can be formatted for atomic writes as follows:
mkfs.xfs  -i atomicwrites=1 -b size=16384 /dev/sda

Support can be enabled through xfs_io command:
$xfs_io -c "chattr +W" filename
$xfs_io -c "lsattr -v" filename
[atomic-writes] filename
$xfs_io -c statx filename
...
stat.stx_atomic_write_unit_min = 16384
stat.stx_atomic_write_unit_max = 16384
stat.stx_atomic_write_segments_max = 1
...

Dependence on forcealign is gone for the moment. This means that we can
only write a single FS block atomically. Since we can now have FS block
size > PAGE_SIZE for XFS, we can write atomically write > 4K blocks on
x86.

Using the large FS block size has downsides, so we still want the
forcealign feature.

Baseline is v6.12-rc1

Basic xfsprogs support at:
https://github.com/johnpgarry/xfsprogs-dev/tree/atomic-writes-v6

Patches for this series can be found at:
https://github.com/johnpgarry/linux/tree/atomic-writes-v6.12-fs-v6

Changes since v5:
- Drop forcealign dependency
- Can support atomically writing a single FS block
- XFS_DIFLAG2_ATOMICWRITES is inherited from parent directories
- Add RB tags from Darrick (thanks!)

Changes since v4:
- Drop iomap extent-based zeroing and use single bio to cover multiple
  extents
- Move forcealign to another series
- Various change in ioctl, sb, inode validation
- Add patch to tweak generic_atomic_write_valid() API

John Garry (7):
  block/fs: Pass an iocb to generic_atomic_write_valid()
  fs: Export generic_atomic_write_valid()
  fs: iomap: Atomic write support
  xfs: Support FS_XFLAG_ATOMICWRITES
  xfs: Support atomic write for statx
  xfs: Validate atomic writes
  xfs: Support setting FMODE_CAN_ATOMIC_WRITE

 block/fops.c                   |  8 +++----
 fs/iomap/direct-io.c           | 26 ++++++++++++++++++++---
 fs/iomap/trace.h               |  3 ++-
 fs/read_write.c                |  5 +++--
 fs/xfs/libxfs/xfs_format.h     | 11 ++++++++--
 fs/xfs/libxfs/xfs_inode_buf.c  | 38 ++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_inode_util.c |  6 ++++++
 fs/xfs/libxfs/xfs_sb.c         |  2 ++
 fs/xfs/xfs_buf.c               | 15 +++++++++++++-
 fs/xfs/xfs_buf.h               |  5 ++++-
 fs/xfs/xfs_buf_mem.c           |  2 +-
 fs/xfs/xfs_file.c              | 19 +++++++++++++++++
 fs/xfs/xfs_inode.h             | 22 ++++++++++++++++++++
 fs/xfs/xfs_ioctl.c             | 37 +++++++++++++++++++++++++++++++++
 fs/xfs/xfs_iops.c              | 24 +++++++++++++++++++++
 fs/xfs/xfs_mount.h             |  2 ++
 fs/xfs/xfs_reflink.c           |  4 ++++
 fs/xfs/xfs_super.c             |  4 ++++
 include/linux/fs.h             |  2 +-
 include/linux/iomap.h          |  1 +
 include/uapi/linux/fs.h        |  1 +
 21 files changed, 221 insertions(+), 16 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v6 1/7] block/fs: Pass an iocb to generic_atomic_write_valid()
  2024-09-30 12:54 [PATCH v6 0/7] block atomic writes for xfs John Garry
@ 2024-09-30 12:54 ` John Garry
  2024-10-01  8:39   ` Christoph Hellwig
  2024-09-30 12:54 ` [PATCH v6 2/7] fs: Export generic_atomic_write_valid() John Garry
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 29+ messages in thread
From: John Garry @ 2024-09-30 12:54 UTC (permalink / raw)
  To: axboe, brauner, djwong, viro, jack, dchinner, hch, cem
  Cc: linux-block, linux-kernel, linux-xfs, linux-fsdevel, hare,
	martin.petersen, catherine.hoang, mcgrof, ritesh.list, ojaswin,
	John Garry

Darrick and Hannes both thought it better that generic_atomic_write_valid()
should be passed a struct iocb, and not just the member of that struct
which is referenced; see [0] and [1].

I think that makes a more generic and clean API, so make that change.

[0] https://lore.kernel.org/linux-block/680ce641-729b-4150-b875-531a98657682@suse.de/
[1] https://lore.kernel.org/linux-xfs/20240620212401.GA3058325@frogsfrogsfrogs/

Suggested-by: "Darrick J. Wong" <djwong@kernel.org>
Suggested-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 block/fops.c       | 8 ++++----
 fs/read_write.c    | 4 ++--
 include/linux/fs.h | 2 +-
 3 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/block/fops.c b/block/fops.c
index e696ae53bf1e..968b47b615c4 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -35,13 +35,13 @@ static blk_opf_t dio_bio_write_op(struct kiocb *iocb)
 	return opf;
 }
 
-static bool blkdev_dio_invalid(struct block_device *bdev, loff_t pos,
+static bool blkdev_dio_invalid(struct block_device *bdev, struct kiocb *iocb,
 				struct iov_iter *iter, bool is_atomic)
 {
-	if (is_atomic && !generic_atomic_write_valid(iter, pos))
+	if (is_atomic && !generic_atomic_write_valid(iocb, iter))
 		return true;
 
-	return pos & (bdev_logical_block_size(bdev) - 1) ||
+	return iocb->ki_pos & (bdev_logical_block_size(bdev) - 1) ||
 		!bdev_iter_is_aligned(bdev, iter);
 }
 
@@ -374,7 +374,7 @@ static ssize_t blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	if (!iov_iter_count(iter))
 		return 0;
 
-	if (blkdev_dio_invalid(bdev, iocb->ki_pos, iter, is_atomic))
+	if (blkdev_dio_invalid(bdev, iocb, iter, is_atomic))
 		return -EINVAL;
 
 	nr_pages = bio_iov_vecs_to_alloc(iter, BIO_MAX_VECS + 1);
diff --git a/fs/read_write.c b/fs/read_write.c
index 64dc24afdb3a..2c3263530828 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1830,7 +1830,7 @@ int generic_file_rw_checks(struct file *file_in, struct file *file_out)
 	return 0;
 }
 
-bool generic_atomic_write_valid(struct iov_iter *iter, loff_t pos)
+bool generic_atomic_write_valid(struct kiocb *iocb, struct iov_iter *iter)
 {
 	size_t len = iov_iter_count(iter);
 
@@ -1840,7 +1840,7 @@ bool generic_atomic_write_valid(struct iov_iter *iter, loff_t pos)
 	if (!is_power_of_2(len))
 		return false;
 
-	if (!IS_ALIGNED(pos, len))
+	if (!IS_ALIGNED(iocb->ki_pos, len))
 		return false;
 
 	return true;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e3c603d01337..fbfa032d1d90 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3721,6 +3721,6 @@ static inline bool vfs_empty_path(int dfd, const char __user *path)
 	return !c;
 }
 
-bool generic_atomic_write_valid(struct iov_iter *iter, loff_t pos);
+bool generic_atomic_write_valid(struct kiocb *iocb, struct iov_iter *iter);
 
 #endif /* _LINUX_FS_H */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v6 2/7] fs: Export generic_atomic_write_valid()
  2024-09-30 12:54 [PATCH v6 0/7] block atomic writes for xfs John Garry
  2024-09-30 12:54 ` [PATCH v6 1/7] block/fs: Pass an iocb to generic_atomic_write_valid() John Garry
@ 2024-09-30 12:54 ` John Garry
  2024-10-01  8:41   ` Christoph Hellwig
  2024-09-30 12:54 ` [PATCH v6 3/7] fs: iomap: Atomic write support John Garry
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 29+ messages in thread
From: John Garry @ 2024-09-30 12:54 UTC (permalink / raw)
  To: axboe, brauner, djwong, viro, jack, dchinner, hch, cem
  Cc: linux-block, linux-kernel, linux-xfs, linux-fsdevel, hare,
	martin.petersen, catherine.hoang, mcgrof, ritesh.list, ojaswin,
	John Garry

The XFS code will need this.

Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/read_write.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/read_write.c b/fs/read_write.c
index 2c3263530828..32b476bf9be0 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1845,3 +1845,4 @@ bool generic_atomic_write_valid(struct kiocb *iocb, struct iov_iter *iter)
 
 	return true;
 }
+EXPORT_SYMBOL_GPL(generic_atomic_write_valid);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v6 3/7] fs: iomap: Atomic write support
  2024-09-30 12:54 [PATCH v6 0/7] block atomic writes for xfs John Garry
  2024-09-30 12:54 ` [PATCH v6 1/7] block/fs: Pass an iocb to generic_atomic_write_valid() John Garry
  2024-09-30 12:54 ` [PATCH v6 2/7] fs: Export generic_atomic_write_valid() John Garry
@ 2024-09-30 12:54 ` John Garry
  2024-09-30 15:55   ` Darrick J. Wong
  2024-09-30 12:54 ` [PATCH v6 4/7] xfs: Support FS_XFLAG_ATOMICWRITES John Garry
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 29+ messages in thread
From: John Garry @ 2024-09-30 12:54 UTC (permalink / raw)
  To: axboe, brauner, djwong, viro, jack, dchinner, hch, cem
  Cc: linux-block, linux-kernel, linux-xfs, linux-fsdevel, hare,
	martin.petersen, catherine.hoang, mcgrof, ritesh.list, ojaswin,
	John Garry

Support direct I/O atomic writes by producing a single bio with REQ_ATOMIC
flag set.

Initially FSes (XFS) should only support writing a single FS block
atomically.

As with any atomic write, we should produce a single bio which covers the
complete write length.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
Maybe we should also enforce that a mapping is in written state, as it
avoids issues later with forcealign and writing mappings which cover
multiple extents in different written/unwritten state.

 fs/iomap/direct-io.c  | 26 +++++++++++++++++++++++---
 fs/iomap/trace.h      |  3 ++-
 include/linux/iomap.h |  1 +
 3 files changed, 26 insertions(+), 4 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index f637aa0706a3..9401c05cd2c0 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -271,7 +271,7 @@ static int iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
  * clearing the WRITE_THROUGH flag in the dio request.
  */
 static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
-		const struct iomap *iomap, bool use_fua)
+		const struct iomap *iomap, bool use_fua, bool atomic)
 {
 	blk_opf_t opflags = REQ_SYNC | REQ_IDLE;
 
@@ -283,6 +283,8 @@ static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
 		opflags |= REQ_FUA;
 	else
 		dio->flags &= ~IOMAP_DIO_WRITE_THROUGH;
+	if (atomic)
+		opflags |= REQ_ATOMIC;
 
 	return opflags;
 }
@@ -293,7 +295,8 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 	const struct iomap *iomap = &iter->iomap;
 	struct inode *inode = iter->inode;
 	unsigned int fs_block_size = i_blocksize(inode), pad;
-	loff_t length = iomap_length(iter);
+	const loff_t length = iomap_length(iter);
+	bool atomic = iter->flags & IOMAP_ATOMIC;
 	loff_t pos = iter->pos;
 	blk_opf_t bio_opf;
 	struct bio *bio;
@@ -303,6 +306,9 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 	size_t copied = 0;
 	size_t orig_count;
 
+	if (atomic && (length != fs_block_size))
+		return -EINVAL;
+
 	if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
 	    !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
 		return -EINVAL;
@@ -382,7 +388,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 	 * can set up the page vector appropriately for a ZONE_APPEND
 	 * operation.
 	 */
-	bio_opf = iomap_dio_bio_opflags(dio, iomap, use_fua);
+	bio_opf = iomap_dio_bio_opflags(dio, iomap, use_fua, atomic);
 
 	nr_pages = bio_iov_vecs_to_alloc(dio->submit.iter, BIO_MAX_VECS);
 	do {
@@ -415,6 +421,17 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 		}
 
 		n = bio->bi_iter.bi_size;
+		if (atomic && n != length) {
+			/*
+			 * This bio should have covered the complete length,
+			 * which it doesn't, so error. We may need to zero out
+			 * the tail (complete FS block), similar to when
+			 * bio_iov_iter_get_pages() returns an error, above.
+			 */
+			ret = -EINVAL;
+			bio_put(bio);
+			goto zero_tail;
+		}
 		if (dio->flags & IOMAP_DIO_WRITE) {
 			task_io_account_write(n);
 		} else {
@@ -598,6 +615,9 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	if (iocb->ki_flags & IOCB_NOWAIT)
 		iomi.flags |= IOMAP_NOWAIT;
 
+	if (iocb->ki_flags & IOCB_ATOMIC)
+		iomi.flags |= IOMAP_ATOMIC;
+
 	if (iov_iter_rw(iter) == READ) {
 		/* reads can always complete inline */
 		dio->flags |= IOMAP_DIO_INLINE_COMP;
diff --git a/fs/iomap/trace.h b/fs/iomap/trace.h
index 0a991c4ce87d..4118a42cdab0 100644
--- a/fs/iomap/trace.h
+++ b/fs/iomap/trace.h
@@ -98,7 +98,8 @@ DEFINE_RANGE_EVENT(iomap_dio_rw_queued);
 	{ IOMAP_REPORT,		"REPORT" }, \
 	{ IOMAP_FAULT,		"FAULT" }, \
 	{ IOMAP_DIRECT,		"DIRECT" }, \
-	{ IOMAP_NOWAIT,		"NOWAIT" }
+	{ IOMAP_NOWAIT,		"NOWAIT" }, \
+	{ IOMAP_ATOMIC,		"ATOMIC" }
 
 #define IOMAP_F_FLAGS_STRINGS \
 	{ IOMAP_F_NEW,		"NEW" }, \
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 4ad12a3c8bae..c7644bdcfca3 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -178,6 +178,7 @@ struct iomap_folio_ops {
 #else
 #define IOMAP_DAX		0
 #endif /* CONFIG_FS_DAX */
+#define IOMAP_ATOMIC		(1 << 9)
 
 struct iomap_ops {
 	/*
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v6 4/7] xfs: Support FS_XFLAG_ATOMICWRITES
  2024-09-30 12:54 [PATCH v6 0/7] block atomic writes for xfs John Garry
                   ` (2 preceding siblings ...)
  2024-09-30 12:54 ` [PATCH v6 3/7] fs: iomap: Atomic write support John Garry
@ 2024-09-30 12:54 ` John Garry
  2024-09-30 16:03   ` Darrick J. Wong
  2024-10-03 12:48   ` John Garry
  2024-09-30 12:54 ` [PATCH v6 5/7] xfs: Support atomic write for statx John Garry
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 29+ messages in thread
From: John Garry @ 2024-09-30 12:54 UTC (permalink / raw)
  To: axboe, brauner, djwong, viro, jack, dchinner, hch, cem
  Cc: linux-block, linux-kernel, linux-xfs, linux-fsdevel, hare,
	martin.petersen, catherine.hoang, mcgrof, ritesh.list, ojaswin,
	John Garry

Add initial support for new flag FS_XFLAG_ATOMICWRITES.

This flag is a file attribute that mirrors an ondisk inode flag.  Actual
support for untorn file writes (for now) depends on both the iflag and the
underlying storage devices, which we can only really check at statx and
pwritev2() time.  This is the same story as FS_XFLAG_DAX, which signals to
the fs that we should try to enable the fsdax IO path on the file (instead
of the regular page cache), but applications have to query STAT_ATTR_DAX to
find out if they really got that IO path.

Current kernel support for atomic writes is based on HW support (for atomic
writes). Since for regular files XFS has no way to specify extent alignment
or granularity, atomic write size is limited to the FS block size.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/libxfs/xfs_format.h     | 11 ++++++++--
 fs/xfs/libxfs/xfs_inode_buf.c  | 38 ++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_inode_util.c |  6 ++++++
 fs/xfs/libxfs/xfs_sb.c         |  2 ++
 fs/xfs/xfs_buf.c               | 15 +++++++++++++-
 fs/xfs/xfs_buf.h               |  5 ++++-
 fs/xfs/xfs_buf_mem.c           |  2 +-
 fs/xfs/xfs_inode.h             |  5 +++++
 fs/xfs/xfs_ioctl.c             | 37 +++++++++++++++++++++++++++++++++
 fs/xfs/xfs_mount.h             |  2 ++
 fs/xfs/xfs_reflink.c           |  4 ++++
 fs/xfs/xfs_super.c             |  4 ++++
 include/uapi/linux/fs.h        |  1 +
 13 files changed, 127 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index e1bfee0c3b1a..ed5e5442f0d4 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -352,11 +352,15 @@ xfs_sb_has_compat_feature(
 #define XFS_SB_FEAT_RO_COMPAT_RMAPBT   (1 << 1)		/* reverse map btree */
 #define XFS_SB_FEAT_RO_COMPAT_REFLINK  (1 << 2)		/* reflinked files */
 #define XFS_SB_FEAT_RO_COMPAT_INOBTCNT (1 << 3)		/* inobt block counts */
+#define XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES (1 << 31)	/* atomicwrites enabled */
+
 #define XFS_SB_FEAT_RO_COMPAT_ALL \
 		(XFS_SB_FEAT_RO_COMPAT_FINOBT | \
 		 XFS_SB_FEAT_RO_COMPAT_RMAPBT | \
 		 XFS_SB_FEAT_RO_COMPAT_REFLINK| \
-		 XFS_SB_FEAT_RO_COMPAT_INOBTCNT)
+		 XFS_SB_FEAT_RO_COMPAT_INOBTCNT | \
+		 XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES)
+
 #define XFS_SB_FEAT_RO_COMPAT_UNKNOWN	~XFS_SB_FEAT_RO_COMPAT_ALL
 static inline bool
 xfs_sb_has_ro_compat_feature(
@@ -1093,16 +1097,19 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
 #define XFS_DIFLAG2_COWEXTSIZE_BIT   2  /* copy on write extent size hint */
 #define XFS_DIFLAG2_BIGTIME_BIT	3	/* big timestamps */
 #define XFS_DIFLAG2_NREXT64_BIT 4	/* large extent counters */
+#define XFS_DIFLAG2_ATOMICWRITES_BIT 5	/* atomic writes permitted */
 
 #define XFS_DIFLAG2_DAX		(1 << XFS_DIFLAG2_DAX_BIT)
 #define XFS_DIFLAG2_REFLINK     (1 << XFS_DIFLAG2_REFLINK_BIT)
 #define XFS_DIFLAG2_COWEXTSIZE  (1 << XFS_DIFLAG2_COWEXTSIZE_BIT)
 #define XFS_DIFLAG2_BIGTIME	(1 << XFS_DIFLAG2_BIGTIME_BIT)
 #define XFS_DIFLAG2_NREXT64	(1 << XFS_DIFLAG2_NREXT64_BIT)
+#define XFS_DIFLAG2_ATOMICWRITES	(1 << XFS_DIFLAG2_ATOMICWRITES_BIT)
 
 #define XFS_DIFLAG2_ANY \
 	(XFS_DIFLAG2_DAX | XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE | \
-	 XFS_DIFLAG2_BIGTIME | XFS_DIFLAG2_NREXT64)
+	 XFS_DIFLAG2_BIGTIME | XFS_DIFLAG2_NREXT64 | \
+	 XFS_DIFLAG2_ATOMICWRITES)
 
 static inline bool xfs_dinode_has_bigtime(const struct xfs_dinode *dip)
 {
diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index 79babeac9d75..1e852cdd1d6f 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -483,6 +483,36 @@ xfs_dinode_verify_nrext64(
 	return NULL;
 }
 
+static xfs_failaddr_t
+xfs_inode_validate_atomicwrites(
+	struct xfs_mount	*mp,
+	uint32_t		cowextsize,
+	uint16_t		mode,
+	int64_t			flags2)
+{
+	/* superblock rocompat feature flag */
+	if (!xfs_has_atomicwrites(mp))
+		return __this_address;
+
+	/* Only regular files and directories */
+	if (!S_ISREG(mode) && !(S_ISDIR(mode)))
+		return __this_address;
+
+	/* COW extsize disallowed */
+	if (flags2 & XFS_DIFLAG2_COWEXTSIZE)
+		return __this_address;
+
+	/* cowextsize must be zero */
+	if (cowextsize)
+		return __this_address;
+
+	/* reflink is disallowed */
+	if (flags2 & XFS_DIFLAG2_REFLINK)
+		return __this_address;
+
+	return NULL;
+}
+
 xfs_failaddr_t
 xfs_dinode_verify(
 	struct xfs_mount	*mp,
@@ -663,6 +693,14 @@ xfs_dinode_verify(
 	    !xfs_has_bigtime(mp))
 		return __this_address;
 
+	if (flags2 & XFS_DIFLAG2_ATOMICWRITES) {
+		fa = xfs_inode_validate_atomicwrites(mp,
+				be32_to_cpu(dip->di_cowextsize),
+				mode, flags2);
+		if (fa)
+			return fa;
+	}
+
 	return NULL;
 }
 
diff --git a/fs/xfs/libxfs/xfs_inode_util.c b/fs/xfs/libxfs/xfs_inode_util.c
index cc38e1c3c3e1..e59e98783bf7 100644
--- a/fs/xfs/libxfs/xfs_inode_util.c
+++ b/fs/xfs/libxfs/xfs_inode_util.c
@@ -80,6 +80,8 @@ xfs_flags2diflags2(
 		di_flags2 |= XFS_DIFLAG2_DAX;
 	if (xflags & FS_XFLAG_COWEXTSIZE)
 		di_flags2 |= XFS_DIFLAG2_COWEXTSIZE;
+	if (xflags & FS_XFLAG_ATOMICWRITES)
+		di_flags2 |= XFS_DIFLAG2_ATOMICWRITES;
 
 	return di_flags2;
 }
@@ -126,6 +128,8 @@ xfs_ip2xflags(
 			flags |= FS_XFLAG_DAX;
 		if (ip->i_diflags2 & XFS_DIFLAG2_COWEXTSIZE)
 			flags |= FS_XFLAG_COWEXTSIZE;
+		if (ip->i_diflags2 & XFS_DIFLAG2_ATOMICWRITES)
+			flags |= FS_XFLAG_ATOMICWRITES;
 	}
 
 	if (xfs_inode_has_attr_fork(ip))
@@ -224,6 +228,8 @@ xfs_inode_inherit_flags2(
 	}
 	if (pip->i_diflags2 & XFS_DIFLAG2_DAX)
 		ip->i_diflags2 |= XFS_DIFLAG2_DAX;
+	if (pip->i_diflags2 & XFS_DIFLAG2_ATOMICWRITES)
+		ip->i_diflags2 |= XFS_DIFLAG2_ATOMICWRITES;
 
 	/* Don't let invalid cowextsize hints propagate. */
 	failaddr = xfs_inode_validate_cowextsize(ip->i_mount, ip->i_cowextsize,
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index d95409f3cba6..dd819561d0a5 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -164,6 +164,8 @@ xfs_sb_version_to_features(
 		features |= XFS_FEAT_REFLINK;
 	if (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_INOBTCNT)
 		features |= XFS_FEAT_INOBTCNT;
+	if (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES)
+		features |= XFS_FEAT_ATOMICWRITES;
 	if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_FTYPE)
 		features |= XFS_FEAT_FTYPE;
 	if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_SPINODES)
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index aa4dbda7b536..44bee3e2b2bb 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -2060,6 +2060,8 @@ int
 xfs_init_buftarg(
 	struct xfs_buftarg		*btp,
 	size_t				logical_sectorsize,
+	unsigned int			awu_min,
+	unsigned int			awu_max,
 	const char			*descr)
 {
 	/* Set up device logical sector size mask */
@@ -2086,6 +2088,9 @@ xfs_init_buftarg(
 	btp->bt_shrinker->scan_objects = xfs_buftarg_shrink_scan;
 	btp->bt_shrinker->private_data = btp;
 	shrinker_register(btp->bt_shrinker);
+
+	btp->bt_bdev_awu_min = awu_min;
+	btp->bt_bdev_awu_max = awu_max;
 	return 0;
 
 out_destroy_io_count:
@@ -2102,6 +2107,7 @@ xfs_alloc_buftarg(
 {
 	struct xfs_buftarg	*btp;
 	const struct dax_holder_operations *ops = NULL;
+	unsigned int awu_min = 0, awu_max = 0;
 
 #if defined(CONFIG_FS_DAX) && defined(CONFIG_MEMORY_FAILURE)
 	ops = &xfs_dax_holder_operations;
@@ -2115,6 +2121,13 @@ xfs_alloc_buftarg(
 	btp->bt_daxdev = fs_dax_get_by_bdev(btp->bt_bdev, &btp->bt_dax_part_off,
 					    mp, ops);
 
+	if (bdev_can_atomic_write(btp->bt_bdev)) {
+		struct request_queue *q = bdev_get_queue(btp->bt_bdev);
+
+		awu_min = queue_atomic_write_unit_min_bytes(q);
+		awu_max = queue_atomic_write_unit_max_bytes(q);
+	}
+
 	/*
 	 * When allocating the buftargs we have not yet read the super block and
 	 * thus don't know the file system sector size yet.
@@ -2122,7 +2135,7 @@ xfs_alloc_buftarg(
 	if (xfs_setsize_buftarg(btp, bdev_logical_block_size(btp->bt_bdev)))
 		goto error_free;
 	if (xfs_init_buftarg(btp, bdev_logical_block_size(btp->bt_bdev),
-			mp->m_super->s_id))
+			awu_min, awu_max, mp->m_super->s_id))
 		goto error_free;
 
 	return btp;
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index 209a389f2abc..b813cb60a8f3 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -124,6 +124,9 @@ struct xfs_buftarg {
 	struct percpu_counter	bt_io_count;
 	struct ratelimit_state	bt_ioerror_rl;
 
+	/* Atomic write unit values */
+	unsigned int		bt_bdev_awu_min, bt_bdev_awu_max;
+
 	/* built-in cache, if we're not using the perag one */
 	struct xfs_buf_cache	bt_cache[];
 };
@@ -393,7 +396,7 @@ bool xfs_verify_magic16(struct xfs_buf *bp, __be16 dmagic);
 
 /* for xfs_buf_mem.c only: */
 int xfs_init_buftarg(struct xfs_buftarg *btp, size_t logical_sectorsize,
-		const char *descr);
+		unsigned int awu_min, unsigned int awu_max, const char *descr);
 void xfs_destroy_buftarg(struct xfs_buftarg *btp);
 
 #endif	/* __XFS_BUF_H__ */
diff --git a/fs/xfs/xfs_buf_mem.c b/fs/xfs/xfs_buf_mem.c
index 07bebbfb16ee..722d75f89767 100644
--- a/fs/xfs/xfs_buf_mem.c
+++ b/fs/xfs/xfs_buf_mem.c
@@ -93,7 +93,7 @@ xmbuf_alloc(
 	btp->bt_meta_sectorsize = XMBUF_BLOCKSIZE;
 	btp->bt_meta_sectormask = XMBUF_BLOCKSIZE - 1;
 
-	error = xfs_init_buftarg(btp, XMBUF_BLOCKSIZE, descr);
+	error = xfs_init_buftarg(btp, XMBUF_BLOCKSIZE, 0, 0, descr);
 	if (error)
 		goto out_bcache;
 
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 97ed912306fd..1c62ee294a5a 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -327,6 +327,11 @@ static inline bool xfs_inode_has_bigrtalloc(struct xfs_inode *ip)
 	(XFS_IS_REALTIME_INODE(ip) ? \
 		(ip)->i_mount->m_rtdev_targp : (ip)->i_mount->m_ddev_targp)
 
+static inline bool xfs_inode_has_atomicwrites(struct xfs_inode *ip)
+{
+	return ip->i_diflags2 & XFS_DIFLAG2_ATOMICWRITES;
+}
+
 /*
  * In-core inode flags.
  */
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index a20d426ef021..81872c32dcb2 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -469,6 +469,36 @@ xfs_fileattr_get(
 	return 0;
 }
 
+static int
+xfs_ioctl_setattr_atomicwrites(
+	struct xfs_inode	*ip,
+	struct fileattr		*fa)
+{
+	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_sb		*sbp = &mp->m_sb;
+
+	if (!xfs_has_atomicwrites(mp))
+		return -EINVAL;
+
+	if (target->bt_bdev_awu_min > sbp->sb_blocksize)
+		return -EINVAL;
+
+	if (target->bt_bdev_awu_max < sbp->sb_blocksize)
+		return -EINVAL;
+
+	if (xfs_is_reflink_inode(ip))
+		return -EINVAL;
+
+	if (fa->fsx_xflags & FS_XFLAG_COWEXTSIZE)
+		return -EINVAL;
+
+	if (fa->fsx_cowextsize)
+		return -EINVAL;
+
+	return 0;
+}
+
 static int
 xfs_ioctl_setattr_xflags(
 	struct xfs_trans	*tp,
@@ -478,6 +508,7 @@ xfs_ioctl_setattr_xflags(
 	struct xfs_mount	*mp = ip->i_mount;
 	bool			rtflag = (fa->fsx_xflags & FS_XFLAG_REALTIME);
 	uint64_t		i_flags2;
+	int			error;
 
 	if (rtflag != XFS_IS_REALTIME_INODE(ip)) {
 		/* Can't change realtime flag if any extents are allocated. */
@@ -512,6 +543,12 @@ xfs_ioctl_setattr_xflags(
 	if (i_flags2 && !xfs_has_v3inodes(mp))
 		return -EINVAL;
 
+	if (fa->fsx_xflags & FS_XFLAG_ATOMICWRITES) {
+		error = xfs_ioctl_setattr_atomicwrites(ip, fa);
+		if (error)
+			return error;
+	}
+
 	ip->i_diflags = xfs_flags2diflags(ip, fa->fsx_xflags);
 	ip->i_diflags2 = i_flags2;
 
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 96496f39f551..6ac6518a2ef3 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -298,6 +298,7 @@ typedef struct xfs_mount {
 #define XFS_FEAT_NEEDSREPAIR	(1ULL << 25)	/* needs xfs_repair */
 #define XFS_FEAT_NREXT64	(1ULL << 26)	/* large extent counters */
 #define XFS_FEAT_EXCHANGE_RANGE	(1ULL << 27)	/* exchange range */
+#define XFS_FEAT_ATOMICWRITES	(1ULL << 28)	/* atomic writes support */
 
 /* Mount features */
 #define XFS_FEAT_NOATTR2	(1ULL << 48)	/* disable attr2 creation */
@@ -384,6 +385,7 @@ __XFS_ADD_V4_FEAT(projid32, PROJID32)
 __XFS_HAS_V4_FEAT(v3inodes, V3INODES)
 __XFS_HAS_V4_FEAT(crc, CRC)
 __XFS_HAS_V4_FEAT(pquotino, PQUOTINO)
+__XFS_HAS_FEAT(atomicwrites, ATOMICWRITES)
 
 /*
  * Mount features
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 6fde6ec8092f..6679b12a56c9 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1471,6 +1471,10 @@ xfs_reflink_remap_prep(
 	if (XFS_IS_REALTIME_INODE(src) || XFS_IS_REALTIME_INODE(dest))
 		goto out_unlock;
 
+	/* Don't reflink atomic write inodes */
+	if (xfs_inode_has_atomicwrites(src) || xfs_inode_has_atomicwrites(dest))
+		goto out_unlock;
+
 	/* Don't share DAX file data with non-DAX file. */
 	if (IS_DAX(inode_in) != IS_DAX(inode_out))
 		goto out_unlock;
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index fbb3a1594c0d..97c1d9493cdb 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1733,6 +1733,10 @@ xfs_fs_fill_super(
 		mp->m_features &= ~XFS_FEAT_DISCARD;
 	}
 
+	if (xfs_has_atomicwrites(mp))
+		xfs_warn(mp,
+	"EXPERIMENTAL atomicwrites feature in use. Use at your own risk!");
+
 	if (xfs_has_reflink(mp)) {
 		if (mp->m_sb.sb_rblocks) {
 			xfs_alert(mp,
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 753971770733..e813217e0fe4 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -158,6 +158,7 @@ struct fsxattr {
 #define FS_XFLAG_FILESTREAM	0x00004000	/* use filestream allocator */
 #define FS_XFLAG_DAX		0x00008000	/* use DAX for IO */
 #define FS_XFLAG_COWEXTSIZE	0x00010000	/* CoW extent size allocator hint */
+#define FS_XFLAG_ATOMICWRITES	0x00020000	/* atomic writes enabled */
 #define FS_XFLAG_HASATTR	0x80000000	/* no DIFLAG for this	*/
 
 /* the read-only stuff doesn't really belong here, but any other place is
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v6 5/7] xfs: Support atomic write for statx
  2024-09-30 12:54 [PATCH v6 0/7] block atomic writes for xfs John Garry
                   ` (3 preceding siblings ...)
  2024-09-30 12:54 ` [PATCH v6 4/7] xfs: Support FS_XFLAG_ATOMICWRITES John Garry
@ 2024-09-30 12:54 ` John Garry
  2024-09-30 16:37   ` Darrick J. Wong
  2024-09-30 12:54 ` [PATCH v6 6/7] xfs: Validate atomic writes John Garry
  2024-09-30 12:54 ` [PATCH v6 7/7] xfs: Support setting FMODE_CAN_ATOMIC_WRITE John Garry
  6 siblings, 1 reply; 29+ messages in thread
From: John Garry @ 2024-09-30 12:54 UTC (permalink / raw)
  To: axboe, brauner, djwong, viro, jack, dchinner, hch, cem
  Cc: linux-block, linux-kernel, linux-xfs, linux-fsdevel, hare,
	martin.petersen, catherine.hoang, mcgrof, ritesh.list, ojaswin,
	John Garry

Support providing info on atomic write unit min and max for an inode.

For simplicity, currently we limit the min at the FS block size. As for
max, we limit also at FS block size, as there is no current method to
guarantee extent alignment or granularity for regular files.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/xfs_inode.h | 17 +++++++++++++++++
 fs/xfs/xfs_iops.c  | 24 ++++++++++++++++++++++++
 2 files changed, 41 insertions(+)

diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 1c62ee294a5a..1ea73402d592 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -332,6 +332,23 @@ static inline bool xfs_inode_has_atomicwrites(struct xfs_inode *ip)
 	return ip->i_diflags2 & XFS_DIFLAG2_ATOMICWRITES;
 }
 
+static inline bool
+xfs_inode_can_atomicwrite(
+	struct xfs_inode	*ip)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
+
+	if (!xfs_inode_has_atomicwrites(ip))
+		return false;
+	if (mp->m_sb.sb_blocksize < target->bt_bdev_awu_min)
+		return false;
+	if (mp->m_sb.sb_blocksize > target->bt_bdev_awu_max)
+		return false;
+
+	return true;
+}
+
 /*
  * In-core inode flags.
  */
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index ee79cf161312..915d057db9bb 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -570,6 +570,23 @@ xfs_stat_blksize(
 	return max_t(uint32_t, PAGE_SIZE, mp->m_sb.sb_blocksize);
 }
 
+static void
+xfs_get_atomic_write_attr(
+	struct xfs_inode	*ip,
+	unsigned int		*unit_min,
+	unsigned int		*unit_max)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_sb		*sbp = &mp->m_sb;
+
+	if (!xfs_inode_can_atomicwrite(ip)) {
+		*unit_min = *unit_max = 0;
+		return;
+	}
+
+	*unit_min = *unit_max = sbp->sb_blocksize;
+}
+
 STATIC int
 xfs_vn_getattr(
 	struct mnt_idmap	*idmap,
@@ -643,6 +660,13 @@ xfs_vn_getattr(
 			stat->dio_mem_align = bdev_dma_alignment(bdev) + 1;
 			stat->dio_offset_align = bdev_logical_block_size(bdev);
 		}
+		if (request_mask & STATX_WRITE_ATOMIC) {
+			unsigned int unit_min, unit_max;
+
+			xfs_get_atomic_write_attr(ip, &unit_min, &unit_max);
+			generic_fill_statx_atomic_writes(stat,
+				unit_min, unit_max);
+		}
 		fallthrough;
 	default:
 		stat->blksize = xfs_stat_blksize(ip);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v6 6/7] xfs: Validate atomic writes
  2024-09-30 12:54 [PATCH v6 0/7] block atomic writes for xfs John Garry
                   ` (4 preceding siblings ...)
  2024-09-30 12:54 ` [PATCH v6 5/7] xfs: Support atomic write for statx John Garry
@ 2024-09-30 12:54 ` John Garry
  2024-09-30 16:41   ` Darrick J. Wong
  2024-09-30 12:54 ` [PATCH v6 7/7] xfs: Support setting FMODE_CAN_ATOMIC_WRITE John Garry
  6 siblings, 1 reply; 29+ messages in thread
From: John Garry @ 2024-09-30 12:54 UTC (permalink / raw)
  To: axboe, brauner, djwong, viro, jack, dchinner, hch, cem
  Cc: linux-block, linux-kernel, linux-xfs, linux-fsdevel, hare,
	martin.petersen, catherine.hoang, mcgrof, ritesh.list, ojaswin,
	John Garry

Validate that an atomic write adheres to length/offset rules. Currently
we can only write a single FS block.

For an IOCB with IOCB_ATOMIC set to get as far as xfs_file_dio_write(),
FMODE_CAN_ATOMIC_WRITE will need to be set for the file; for this,
ATOMICWRITES flags would also need to be set for the inode.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/xfs_file.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 412b1d71b52b..fa6a44b88ecc 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -688,6 +688,13 @@ xfs_file_dio_write(
 	struct xfs_buftarg      *target = xfs_inode_buftarg(ip);
 	size_t			count = iov_iter_count(from);
 
+	if (iocb->ki_flags & IOCB_ATOMIC) {
+		if (count != ip->i_mount->m_sb.sb_blocksize)
+			return -EINVAL;
+		if (!generic_atomic_write_valid(iocb, from))
+			return -EINVAL;
+	}
+
 	/* direct I/O must be aligned to device logical sector size */
 	if ((iocb->ki_pos | count) & target->bt_logical_sectormask)
 		return -EINVAL;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v6 7/7] xfs: Support setting FMODE_CAN_ATOMIC_WRITE
  2024-09-30 12:54 [PATCH v6 0/7] block atomic writes for xfs John Garry
                   ` (5 preceding siblings ...)
  2024-09-30 12:54 ` [PATCH v6 6/7] xfs: Validate atomic writes John Garry
@ 2024-09-30 12:54 ` John Garry
  2024-09-30 16:41   ` Darrick J. Wong
  6 siblings, 1 reply; 29+ messages in thread
From: John Garry @ 2024-09-30 12:54 UTC (permalink / raw)
  To: axboe, brauner, djwong, viro, jack, dchinner, hch, cem
  Cc: linux-block, linux-kernel, linux-xfs, linux-fsdevel, hare,
	martin.petersen, catherine.hoang, mcgrof, ritesh.list, ojaswin,
	John Garry

For when an inode is enabled for atomic writes, set FMODE_CAN_ATOMIC_WRITE
flag. Only direct IO is currently supported, so check for that also.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/xfs_file.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index fa6a44b88ecc..a358657a1ae6 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1208,6 +1208,16 @@ xfs_file_remap_range(
 	return remapped > 0 ? remapped : ret;
 }
 
+static bool
+xfs_file_open_can_atomicwrite(
+	struct inode		*inode,
+	struct file		*file)
+{
+	if (!(file->f_flags & O_DIRECT))
+		return false;
+	return xfs_inode_can_atomicwrite(XFS_I(inode));
+}
+
 STATIC int
 xfs_file_open(
 	struct inode	*inode,
@@ -1216,6 +1226,8 @@ xfs_file_open(
 	if (xfs_is_shutdown(XFS_M(inode->i_sb)))
 		return -EIO;
 	file->f_mode |= FMODE_NOWAIT | FMODE_CAN_ODIRECT;
+	if (xfs_file_open_can_atomicwrite(inode, file))
+		file->f_mode |= FMODE_CAN_ATOMIC_WRITE;
 	return generic_file_open(inode, file);
 }
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 3/7] fs: iomap: Atomic write support
  2024-09-30 12:54 ` [PATCH v6 3/7] fs: iomap: Atomic write support John Garry
@ 2024-09-30 15:55   ` Darrick J. Wong
  2024-10-01  8:05     ` John Garry
  0 siblings, 1 reply; 29+ messages in thread
From: Darrick J. Wong @ 2024-09-30 15:55 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, brauner, viro, jack, dchinner, hch, cem, linux-block,
	linux-kernel, linux-xfs, linux-fsdevel, hare, martin.petersen,
	catherine.hoang, mcgrof, ritesh.list, ojaswin

On Mon, Sep 30, 2024 at 12:54:34PM +0000, John Garry wrote:
> Support direct I/O atomic writes by producing a single bio with REQ_ATOMIC
> flag set.
> 
> Initially FSes (XFS) should only support writing a single FS block
> atomically.
> 
> As with any atomic write, we should produce a single bio which covers the
> complete write length.
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
> Maybe we should also enforce that a mapping is in written state, as it
> avoids issues later with forcealign and writing mappings which cover
> multiple extents in different written/unwritten state.
> 
>  fs/iomap/direct-io.c  | 26 +++++++++++++++++++++++---
>  fs/iomap/trace.h      |  3 ++-
>  include/linux/iomap.h |  1 +
>  3 files changed, 26 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index f637aa0706a3..9401c05cd2c0 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -271,7 +271,7 @@ static int iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
>   * clearing the WRITE_THROUGH flag in the dio request.
>   */
>  static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
> -		const struct iomap *iomap, bool use_fua)
> +		const struct iomap *iomap, bool use_fua, bool atomic)
>  {
>  	blk_opf_t opflags = REQ_SYNC | REQ_IDLE;
>  
> @@ -283,6 +283,8 @@ static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
>  		opflags |= REQ_FUA;
>  	else
>  		dio->flags &= ~IOMAP_DIO_WRITE_THROUGH;
> +	if (atomic)
> +		opflags |= REQ_ATOMIC;
>  
>  	return opflags;
>  }
> @@ -293,7 +295,8 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>  	const struct iomap *iomap = &iter->iomap;
>  	struct inode *inode = iter->inode;
>  	unsigned int fs_block_size = i_blocksize(inode), pad;
> -	loff_t length = iomap_length(iter);
> +	const loff_t length = iomap_length(iter);
> +	bool atomic = iter->flags & IOMAP_ATOMIC;
>  	loff_t pos = iter->pos;
>  	blk_opf_t bio_opf;
>  	struct bio *bio;
> @@ -303,6 +306,9 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>  	size_t copied = 0;
>  	size_t orig_count;
>  
> +	if (atomic && (length != fs_block_size))
> +		return -EINVAL;
> +
>  	if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
>  	    !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
>  		return -EINVAL;
> @@ -382,7 +388,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>  	 * can set up the page vector appropriately for a ZONE_APPEND
>  	 * operation.
>  	 */
> -	bio_opf = iomap_dio_bio_opflags(dio, iomap, use_fua);
> +	bio_opf = iomap_dio_bio_opflags(dio, iomap, use_fua, atomic);
>  
>  	nr_pages = bio_iov_vecs_to_alloc(dio->submit.iter, BIO_MAX_VECS);
>  	do {
> @@ -415,6 +421,17 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>  		}
>  
>  		n = bio->bi_iter.bi_size;
> +		if (atomic && n != length) {
> +			/*
> +			 * This bio should have covered the complete length,
> +			 * which it doesn't, so error. We may need to zero out
> +			 * the tail (complete FS block), similar to when
> +			 * bio_iov_iter_get_pages() returns an error, above.
> +			 */
> +			ret = -EINVAL;
> +			bio_put(bio);
> +			goto zero_tail;
> +		}
>  		if (dio->flags & IOMAP_DIO_WRITE) {
>  			task_io_account_write(n);
>  		} else {
> @@ -598,6 +615,9 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  	if (iocb->ki_flags & IOCB_NOWAIT)
>  		iomi.flags |= IOMAP_NOWAIT;
>  
> +	if (iocb->ki_flags & IOCB_ATOMIC)
> +		iomi.flags |= IOMAP_ATOMIC;
> +
>  	if (iov_iter_rw(iter) == READ) {
>  		/* reads can always complete inline */
>  		dio->flags |= IOMAP_DIO_INLINE_COMP;
> diff --git a/fs/iomap/trace.h b/fs/iomap/trace.h
> index 0a991c4ce87d..4118a42cdab0 100644
> --- a/fs/iomap/trace.h
> +++ b/fs/iomap/trace.h
> @@ -98,7 +98,8 @@ DEFINE_RANGE_EVENT(iomap_dio_rw_queued);
>  	{ IOMAP_REPORT,		"REPORT" }, \
>  	{ IOMAP_FAULT,		"FAULT" }, \
>  	{ IOMAP_DIRECT,		"DIRECT" }, \
> -	{ IOMAP_NOWAIT,		"NOWAIT" }
> +	{ IOMAP_NOWAIT,		"NOWAIT" }, \
> +	{ IOMAP_ATOMIC,		"ATOMIC" }
>  
>  #define IOMAP_F_FLAGS_STRINGS \
>  	{ IOMAP_F_NEW,		"NEW" }, \
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 4ad12a3c8bae..c7644bdcfca3 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -178,6 +178,7 @@ struct iomap_folio_ops {
>  #else
>  #define IOMAP_DAX		0
>  #endif /* CONFIG_FS_DAX */
> +#define IOMAP_ATOMIC		(1 << 9)

This new flag needs a documentation update.  What do you think of this?

diff --git a/Documentation/filesystems/iomap/operations.rst b/Documentation/filesystems/iomap/operations.rst
index 8e6c721d23301..279db993be7fa 100644
--- a/Documentation/filesystems/iomap/operations.rst
+++ b/Documentation/filesystems/iomap/operations.rst
@@ -513,6 +513,16 @@ IOMAP_WRITE`` with any combination of the following enhancements:
    if the mapping is unwritten and the filesystem cannot handle zeroing
    the unaligned regions without exposing stale contents.
 
+ * ``IOMAP_ATOMIC``: This write must be persisted in its entirety or
+   not at all.
+   The write must not be split into multiple I/O requests.
+   The file range to write must be aligned to satisfy the requirements
+   of both the filesystem and the underlying block device's atomic
+   commit capabilities.
+   If filesystem metadata updates are required (e.g. unwritten extent
+   conversion or copy on write), all updates for the entire file range
+   must be committed atomically as well.
+
 Callers commonly hold ``i_rwsem`` in shared or exclusive mode before
 calling this function.
 
--D

>  
>  struct iomap_ops {
>  	/*
> -- 
> 2.31.1
> 
> 

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 4/7] xfs: Support FS_XFLAG_ATOMICWRITES
  2024-09-30 12:54 ` [PATCH v6 4/7] xfs: Support FS_XFLAG_ATOMICWRITES John Garry
@ 2024-09-30 16:03   ` Darrick J. Wong
  2024-09-30 16:44     ` Darrick J. Wong
                       ` (2 more replies)
  2024-10-03 12:48   ` John Garry
  1 sibling, 3 replies; 29+ messages in thread
From: Darrick J. Wong @ 2024-09-30 16:03 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, brauner, viro, jack, dchinner, hch, cem, linux-block,
	linux-kernel, linux-xfs, linux-fsdevel, hare, martin.petersen,
	catherine.hoang, mcgrof, ritesh.list, ojaswin

On Mon, Sep 30, 2024 at 12:54:35PM +0000, John Garry wrote:
> Add initial support for new flag FS_XFLAG_ATOMICWRITES.
> 
> This flag is a file attribute that mirrors an ondisk inode flag.  Actual
> support for untorn file writes (for now) depends on both the iflag and the
> underlying storage devices, which we can only really check at statx and
> pwritev2() time.  This is the same story as FS_XFLAG_DAX, which signals to
> the fs that we should try to enable the fsdax IO path on the file (instead
> of the regular page cache), but applications have to query STAT_ATTR_DAX to
> find out if they really got that IO path.
> 
> Current kernel support for atomic writes is based on HW support (for atomic
> writes). Since for regular files XFS has no way to specify extent alignment
> or granularity, atomic write size is limited to the FS block size.
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  fs/xfs/libxfs/xfs_format.h     | 11 ++++++++--
>  fs/xfs/libxfs/xfs_inode_buf.c  | 38 ++++++++++++++++++++++++++++++++++
>  fs/xfs/libxfs/xfs_inode_util.c |  6 ++++++
>  fs/xfs/libxfs/xfs_sb.c         |  2 ++
>  fs/xfs/xfs_buf.c               | 15 +++++++++++++-
>  fs/xfs/xfs_buf.h               |  5 ++++-
>  fs/xfs/xfs_buf_mem.c           |  2 +-
>  fs/xfs/xfs_inode.h             |  5 +++++
>  fs/xfs/xfs_ioctl.c             | 37 +++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_mount.h             |  2 ++
>  fs/xfs/xfs_reflink.c           |  4 ++++
>  fs/xfs/xfs_super.c             |  4 ++++
>  include/uapi/linux/fs.h        |  1 +
>  13 files changed, 127 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> index e1bfee0c3b1a..ed5e5442f0d4 100644
> --- a/fs/xfs/libxfs/xfs_format.h
> +++ b/fs/xfs/libxfs/xfs_format.h
> @@ -352,11 +352,15 @@ xfs_sb_has_compat_feature(
>  #define XFS_SB_FEAT_RO_COMPAT_RMAPBT   (1 << 1)		/* reverse map btree */
>  #define XFS_SB_FEAT_RO_COMPAT_REFLINK  (1 << 2)		/* reflinked files */
>  #define XFS_SB_FEAT_RO_COMPAT_INOBTCNT (1 << 3)		/* inobt block counts */
> +#define XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES (1 << 31)	/* atomicwrites enabled */
> +
>  #define XFS_SB_FEAT_RO_COMPAT_ALL \
>  		(XFS_SB_FEAT_RO_COMPAT_FINOBT | \
>  		 XFS_SB_FEAT_RO_COMPAT_RMAPBT | \
>  		 XFS_SB_FEAT_RO_COMPAT_REFLINK| \
> -		 XFS_SB_FEAT_RO_COMPAT_INOBTCNT)
> +		 XFS_SB_FEAT_RO_COMPAT_INOBTCNT | \
> +		 XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES)
> +
>  #define XFS_SB_FEAT_RO_COMPAT_UNKNOWN	~XFS_SB_FEAT_RO_COMPAT_ALL
>  static inline bool
>  xfs_sb_has_ro_compat_feature(
> @@ -1093,16 +1097,19 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
>  #define XFS_DIFLAG2_COWEXTSIZE_BIT   2  /* copy on write extent size hint */
>  #define XFS_DIFLAG2_BIGTIME_BIT	3	/* big timestamps */
>  #define XFS_DIFLAG2_NREXT64_BIT 4	/* large extent counters */
> +#define XFS_DIFLAG2_ATOMICWRITES_BIT 5	/* atomic writes permitted */
>  
>  #define XFS_DIFLAG2_DAX		(1 << XFS_DIFLAG2_DAX_BIT)
>  #define XFS_DIFLAG2_REFLINK     (1 << XFS_DIFLAG2_REFLINK_BIT)
>  #define XFS_DIFLAG2_COWEXTSIZE  (1 << XFS_DIFLAG2_COWEXTSIZE_BIT)
>  #define XFS_DIFLAG2_BIGTIME	(1 << XFS_DIFLAG2_BIGTIME_BIT)
>  #define XFS_DIFLAG2_NREXT64	(1 << XFS_DIFLAG2_NREXT64_BIT)
> +#define XFS_DIFLAG2_ATOMICWRITES	(1 << XFS_DIFLAG2_ATOMICWRITES_BIT)
>  
>  #define XFS_DIFLAG2_ANY \
>  	(XFS_DIFLAG2_DAX | XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE | \
> -	 XFS_DIFLAG2_BIGTIME | XFS_DIFLAG2_NREXT64)
> +	 XFS_DIFLAG2_BIGTIME | XFS_DIFLAG2_NREXT64 | \
> +	 XFS_DIFLAG2_ATOMICWRITES)
>  
>  static inline bool xfs_dinode_has_bigtime(const struct xfs_dinode *dip)
>  {
> diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> index 79babeac9d75..1e852cdd1d6f 100644
> --- a/fs/xfs/libxfs/xfs_inode_buf.c
> +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> @@ -483,6 +483,36 @@ xfs_dinode_verify_nrext64(
>  	return NULL;
>  }
>  
> +static xfs_failaddr_t
> +xfs_inode_validate_atomicwrites(
> +	struct xfs_mount	*mp,
> +	uint32_t		cowextsize,
> +	uint16_t		mode,
> +	int64_t			flags2)
> +{
> +	/* superblock rocompat feature flag */
> +	if (!xfs_has_atomicwrites(mp))
> +		return __this_address;
> +
> +	/* Only regular files and directories */
> +	if (!S_ISREG(mode) && !(S_ISDIR(mode)))
> +		return __this_address;
> +
> +	/* COW extsize disallowed */
> +	if (flags2 & XFS_DIFLAG2_COWEXTSIZE)
> +		return __this_address;
> +
> +	/* cowextsize must be zero */
> +	if (cowextsize)
> +		return __this_address;
> +
> +	/* reflink is disallowed */
> +	if (flags2 & XFS_DIFLAG2_REFLINK)
> +		return __this_address;

If we're only allowing atomic writes that are 1 fsblock or less, then
copy on write will work correctly because CoWs are always done with
fsblock granularity.  The ioend remap is also committed atomically.

IOWs, it's forcealign that isn't compatible with reflink and you can
drop this incompatibility.

> +
> +	return NULL;
> +}
> +
>  xfs_failaddr_t
>  xfs_dinode_verify(
>  	struct xfs_mount	*mp,
> @@ -663,6 +693,14 @@ xfs_dinode_verify(
>  	    !xfs_has_bigtime(mp))
>  		return __this_address;
>  
> +	if (flags2 & XFS_DIFLAG2_ATOMICWRITES) {
> +		fa = xfs_inode_validate_atomicwrites(mp,
> +				be32_to_cpu(dip->di_cowextsize),

Technically speaking, the space used by di_cowextsize isn't defined on
!reflink filesystems.  The contents are supposed to be zero, but nobody
actually checks that, so you might want to special case this:

		fa = xfs_inode_validate_atomicwrites(mp,
				xfs_has_reflink(mp) ?
					be32_to_cpu(dip->di_cowextsize) : 0,
				mode, flags2);

(inasmuch as this code is getting ugly and maybe you want to use a
temporary variable)

> +				mode, flags2);
> +		if (fa)
> +			return fa;
> +	}
> +
>  	return NULL;
>  }
>  
> diff --git a/fs/xfs/libxfs/xfs_inode_util.c b/fs/xfs/libxfs/xfs_inode_util.c
> index cc38e1c3c3e1..e59e98783bf7 100644
> --- a/fs/xfs/libxfs/xfs_inode_util.c
> +++ b/fs/xfs/libxfs/xfs_inode_util.c
> @@ -80,6 +80,8 @@ xfs_flags2diflags2(
>  		di_flags2 |= XFS_DIFLAG2_DAX;
>  	if (xflags & FS_XFLAG_COWEXTSIZE)
>  		di_flags2 |= XFS_DIFLAG2_COWEXTSIZE;
> +	if (xflags & FS_XFLAG_ATOMICWRITES)
> +		di_flags2 |= XFS_DIFLAG2_ATOMICWRITES;
>  
>  	return di_flags2;
>  }
> @@ -126,6 +128,8 @@ xfs_ip2xflags(
>  			flags |= FS_XFLAG_DAX;
>  		if (ip->i_diflags2 & XFS_DIFLAG2_COWEXTSIZE)
>  			flags |= FS_XFLAG_COWEXTSIZE;
> +		if (ip->i_diflags2 & XFS_DIFLAG2_ATOMICWRITES)
> +			flags |= FS_XFLAG_ATOMICWRITES;
>  	}
>  
>  	if (xfs_inode_has_attr_fork(ip))
> @@ -224,6 +228,8 @@ xfs_inode_inherit_flags2(
>  	}
>  	if (pip->i_diflags2 & XFS_DIFLAG2_DAX)
>  		ip->i_diflags2 |= XFS_DIFLAG2_DAX;
> +	if (pip->i_diflags2 & XFS_DIFLAG2_ATOMICWRITES)
> +		ip->i_diflags2 |= XFS_DIFLAG2_ATOMICWRITES;
>  
>  	/* Don't let invalid cowextsize hints propagate. */
>  	failaddr = xfs_inode_validate_cowextsize(ip->i_mount, ip->i_cowextsize,
> diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
> index d95409f3cba6..dd819561d0a5 100644
> --- a/fs/xfs/libxfs/xfs_sb.c
> +++ b/fs/xfs/libxfs/xfs_sb.c
> @@ -164,6 +164,8 @@ xfs_sb_version_to_features(
>  		features |= XFS_FEAT_REFLINK;
>  	if (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_INOBTCNT)
>  		features |= XFS_FEAT_INOBTCNT;
> +	if (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES)
> +		features |= XFS_FEAT_ATOMICWRITES;
>  	if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_FTYPE)
>  		features |= XFS_FEAT_FTYPE;
>  	if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_SPINODES)
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index aa4dbda7b536..44bee3e2b2bb 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -2060,6 +2060,8 @@ int
>  xfs_init_buftarg(
>  	struct xfs_buftarg		*btp,
>  	size_t				logical_sectorsize,
> +	unsigned int			awu_min,
> +	unsigned int			awu_max,
>  	const char			*descr)
>  {
>  	/* Set up device logical sector size mask */
> @@ -2086,6 +2088,9 @@ xfs_init_buftarg(
>  	btp->bt_shrinker->scan_objects = xfs_buftarg_shrink_scan;
>  	btp->bt_shrinker->private_data = btp;
>  	shrinker_register(btp->bt_shrinker);
> +
> +	btp->bt_bdev_awu_min = awu_min;
> +	btp->bt_bdev_awu_max = awu_max;
>  	return 0;
>  
>  out_destroy_io_count:
> @@ -2102,6 +2107,7 @@ xfs_alloc_buftarg(
>  {
>  	struct xfs_buftarg	*btp;
>  	const struct dax_holder_operations *ops = NULL;
> +	unsigned int awu_min = 0, awu_max = 0;
>  
>  #if defined(CONFIG_FS_DAX) && defined(CONFIG_MEMORY_FAILURE)
>  	ops = &xfs_dax_holder_operations;
> @@ -2115,6 +2121,13 @@ xfs_alloc_buftarg(
>  	btp->bt_daxdev = fs_dax_get_by_bdev(btp->bt_bdev, &btp->bt_dax_part_off,
>  					    mp, ops);
>  
> +	if (bdev_can_atomic_write(btp->bt_bdev)) {
> +		struct request_queue *q = bdev_get_queue(btp->bt_bdev);
> +
> +		awu_min = queue_atomic_write_unit_min_bytes(q);
> +		awu_max = queue_atomic_write_unit_max_bytes(q);
> +	}
> +
>  	/*
>  	 * When allocating the buftargs we have not yet read the super block and
>  	 * thus don't know the file system sector size yet.
> @@ -2122,7 +2135,7 @@ xfs_alloc_buftarg(
>  	if (xfs_setsize_buftarg(btp, bdev_logical_block_size(btp->bt_bdev)))
>  		goto error_free;
>  	if (xfs_init_buftarg(btp, bdev_logical_block_size(btp->bt_bdev),
> -			mp->m_super->s_id))
> +			awu_min, awu_max, mp->m_super->s_id))
>  		goto error_free;

Rather than passing this into the constructor and making the xmbuf code
pass zeroes, why not set the awu values here in xfs_alloc_buftarg just
before returning btp?

	if (bdev_can_atomic_write(btp->bt_bdev)) {
		struct request_queue *q = bdev_get_queue(btp->bt_bdev);

		btp->bt_bdev_awu_min = queue_atomic_write_unit_min_bytes(q);
		btp->bt_bdev_awu_max = queue_atomic_write_unit_max_bytes(q);
	}


--D

>  	return btp;
> diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
> index 209a389f2abc..b813cb60a8f3 100644
> --- a/fs/xfs/xfs_buf.h
> +++ b/fs/xfs/xfs_buf.h
> @@ -124,6 +124,9 @@ struct xfs_buftarg {
>  	struct percpu_counter	bt_io_count;
>  	struct ratelimit_state	bt_ioerror_rl;
>  
> +	/* Atomic write unit values */
> +	unsigned int		bt_bdev_awu_min, bt_bdev_awu_max;
> +
>  	/* built-in cache, if we're not using the perag one */
>  	struct xfs_buf_cache	bt_cache[];
>  };
> @@ -393,7 +396,7 @@ bool xfs_verify_magic16(struct xfs_buf *bp, __be16 dmagic);
>  
>  /* for xfs_buf_mem.c only: */
>  int xfs_init_buftarg(struct xfs_buftarg *btp, size_t logical_sectorsize,
> -		const char *descr);
> +		unsigned int awu_min, unsigned int awu_max, const char *descr);
>  void xfs_destroy_buftarg(struct xfs_buftarg *btp);
>  
>  #endif	/* __XFS_BUF_H__ */
> diff --git a/fs/xfs/xfs_buf_mem.c b/fs/xfs/xfs_buf_mem.c
> index 07bebbfb16ee..722d75f89767 100644
> --- a/fs/xfs/xfs_buf_mem.c
> +++ b/fs/xfs/xfs_buf_mem.c
> @@ -93,7 +93,7 @@ xmbuf_alloc(
>  	btp->bt_meta_sectorsize = XMBUF_BLOCKSIZE;
>  	btp->bt_meta_sectormask = XMBUF_BLOCKSIZE - 1;
>  
> -	error = xfs_init_buftarg(btp, XMBUF_BLOCKSIZE, descr);
> +	error = xfs_init_buftarg(btp, XMBUF_BLOCKSIZE, 0, 0, descr);
>  	if (error)
>  		goto out_bcache;
>  
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index 97ed912306fd..1c62ee294a5a 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -327,6 +327,11 @@ static inline bool xfs_inode_has_bigrtalloc(struct xfs_inode *ip)
>  	(XFS_IS_REALTIME_INODE(ip) ? \
>  		(ip)->i_mount->m_rtdev_targp : (ip)->i_mount->m_ddev_targp)
>  
> +static inline bool xfs_inode_has_atomicwrites(struct xfs_inode *ip)
> +{
> +	return ip->i_diflags2 & XFS_DIFLAG2_ATOMICWRITES;
> +}
> +
>  /*
>   * In-core inode flags.
>   */
> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> index a20d426ef021..81872c32dcb2 100644
> --- a/fs/xfs/xfs_ioctl.c
> +++ b/fs/xfs/xfs_ioctl.c
> @@ -469,6 +469,36 @@ xfs_fileattr_get(
>  	return 0;
>  }
>  
> +static int
> +xfs_ioctl_setattr_atomicwrites(
> +	struct xfs_inode	*ip,
> +	struct fileattr		*fa)
> +{
> +	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
> +	struct xfs_mount	*mp = ip->i_mount;
> +	struct xfs_sb		*sbp = &mp->m_sb;
> +
> +	if (!xfs_has_atomicwrites(mp))
> +		return -EINVAL;
> +
> +	if (target->bt_bdev_awu_min > sbp->sb_blocksize)
> +		return -EINVAL;
> +
> +	if (target->bt_bdev_awu_max < sbp->sb_blocksize)
> +		return -EINVAL;
> +
> +	if (xfs_is_reflink_inode(ip))
> +		return -EINVAL;
> +
> +	if (fa->fsx_xflags & FS_XFLAG_COWEXTSIZE)
> +		return -EINVAL;
> +
> +	if (fa->fsx_cowextsize)
> +		return -EINVAL;
> +
> +	return 0;
> +}
> +
>  static int
>  xfs_ioctl_setattr_xflags(
>  	struct xfs_trans	*tp,
> @@ -478,6 +508,7 @@ xfs_ioctl_setattr_xflags(
>  	struct xfs_mount	*mp = ip->i_mount;
>  	bool			rtflag = (fa->fsx_xflags & FS_XFLAG_REALTIME);
>  	uint64_t		i_flags2;
> +	int			error;
>  
>  	if (rtflag != XFS_IS_REALTIME_INODE(ip)) {
>  		/* Can't change realtime flag if any extents are allocated. */
> @@ -512,6 +543,12 @@ xfs_ioctl_setattr_xflags(
>  	if (i_flags2 && !xfs_has_v3inodes(mp))
>  		return -EINVAL;
>  
> +	if (fa->fsx_xflags & FS_XFLAG_ATOMICWRITES) {
> +		error = xfs_ioctl_setattr_atomicwrites(ip, fa);
> +		if (error)
> +			return error;
> +	}
> +
>  	ip->i_diflags = xfs_flags2diflags(ip, fa->fsx_xflags);
>  	ip->i_diflags2 = i_flags2;
>  
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index 96496f39f551..6ac6518a2ef3 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -298,6 +298,7 @@ typedef struct xfs_mount {
>  #define XFS_FEAT_NEEDSREPAIR	(1ULL << 25)	/* needs xfs_repair */
>  #define XFS_FEAT_NREXT64	(1ULL << 26)	/* large extent counters */
>  #define XFS_FEAT_EXCHANGE_RANGE	(1ULL << 27)	/* exchange range */
> +#define XFS_FEAT_ATOMICWRITES	(1ULL << 28)	/* atomic writes support */
>  
>  /* Mount features */
>  #define XFS_FEAT_NOATTR2	(1ULL << 48)	/* disable attr2 creation */
> @@ -384,6 +385,7 @@ __XFS_ADD_V4_FEAT(projid32, PROJID32)
>  __XFS_HAS_V4_FEAT(v3inodes, V3INODES)
>  __XFS_HAS_V4_FEAT(crc, CRC)
>  __XFS_HAS_V4_FEAT(pquotino, PQUOTINO)
> +__XFS_HAS_FEAT(atomicwrites, ATOMICWRITES)
>  
>  /*
>   * Mount features
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index 6fde6ec8092f..6679b12a56c9 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -1471,6 +1471,10 @@ xfs_reflink_remap_prep(
>  	if (XFS_IS_REALTIME_INODE(src) || XFS_IS_REALTIME_INODE(dest))
>  		goto out_unlock;
>  
> +	/* Don't reflink atomic write inodes */
> +	if (xfs_inode_has_atomicwrites(src) || xfs_inode_has_atomicwrites(dest))
> +		goto out_unlock;
> +
>  	/* Don't share DAX file data with non-DAX file. */
>  	if (IS_DAX(inode_in) != IS_DAX(inode_out))
>  		goto out_unlock;
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index fbb3a1594c0d..97c1d9493cdb 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1733,6 +1733,10 @@ xfs_fs_fill_super(
>  		mp->m_features &= ~XFS_FEAT_DISCARD;
>  	}
>  
> +	if (xfs_has_atomicwrites(mp))
> +		xfs_warn(mp,
> +	"EXPERIMENTAL atomicwrites feature in use. Use at your own risk!");
> +
>  	if (xfs_has_reflink(mp)) {
>  		if (mp->m_sb.sb_rblocks) {
>  			xfs_alert(mp,
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index 753971770733..e813217e0fe4 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -158,6 +158,7 @@ struct fsxattr {
>  #define FS_XFLAG_FILESTREAM	0x00004000	/* use filestream allocator */
>  #define FS_XFLAG_DAX		0x00008000	/* use DAX for IO */
>  #define FS_XFLAG_COWEXTSIZE	0x00010000	/* CoW extent size allocator hint */
> +#define FS_XFLAG_ATOMICWRITES	0x00020000	/* atomic writes enabled */
>  #define FS_XFLAG_HASATTR	0x80000000	/* no DIFLAG for this	*/
>  
>  /* the read-only stuff doesn't really belong here, but any other place is
> -- 
> 2.31.1
> 
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 5/7] xfs: Support atomic write for statx
  2024-09-30 12:54 ` [PATCH v6 5/7] xfs: Support atomic write for statx John Garry
@ 2024-09-30 16:37   ` Darrick J. Wong
  2024-10-01  8:29     ` John Garry
  2024-10-01  8:43     ` Christoph Hellwig
  0 siblings, 2 replies; 29+ messages in thread
From: Darrick J. Wong @ 2024-09-30 16:37 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, brauner, viro, jack, dchinner, hch, cem, linux-block,
	linux-kernel, linux-xfs, linux-fsdevel, hare, martin.petersen,
	catherine.hoang, mcgrof, ritesh.list, ojaswin

On Mon, Sep 30, 2024 at 12:54:36PM +0000, John Garry wrote:
> Support providing info on atomic write unit min and max for an inode.
> 
> For simplicity, currently we limit the min at the FS block size. As for
> max, we limit also at FS block size, as there is no current method to
> guarantee extent alignment or granularity for regular files.
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  fs/xfs/xfs_inode.h | 17 +++++++++++++++++
>  fs/xfs/xfs_iops.c  | 24 ++++++++++++++++++++++++
>  2 files changed, 41 insertions(+)
> 
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index 1c62ee294a5a..1ea73402d592 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -332,6 +332,23 @@ static inline bool xfs_inode_has_atomicwrites(struct xfs_inode *ip)
>  	return ip->i_diflags2 & XFS_DIFLAG2_ATOMICWRITES;
>  }
>  
> +static inline bool
> +xfs_inode_can_atomicwrite(
> +	struct xfs_inode	*ip)
> +{
> +	struct xfs_mount	*mp = ip->i_mount;
> +	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
> +
> +	if (!xfs_inode_has_atomicwrites(ip))
> +		return false;
> +	if (mp->m_sb.sb_blocksize < target->bt_bdev_awu_min)
> +		return false;
> +	if (mp->m_sb.sb_blocksize > target->bt_bdev_awu_max)
> +		return false;
> +
> +	return true;
> +}
> +
>  /*
>   * In-core inode flags.
>   */
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index ee79cf161312..915d057db9bb 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -570,6 +570,23 @@ xfs_stat_blksize(
>  	return max_t(uint32_t, PAGE_SIZE, mp->m_sb.sb_blocksize);
>  }
>  
> +static void
> +xfs_get_atomic_write_attr(
> +	struct xfs_inode	*ip,
> +	unsigned int		*unit_min,
> +	unsigned int		*unit_max)
> +{
> +	struct xfs_mount	*mp = ip->i_mount;
> +	struct xfs_sb		*sbp = &mp->m_sb;
> +
> +	if (!xfs_inode_can_atomicwrite(ip)) {
> +		*unit_min = *unit_max = 0;
> +		return;
> +	}
> +
> +	*unit_min = *unit_max = sbp->sb_blocksize;

Ok, so we're only supporting untorn writes if they're exactly the fs
blocksize, and 1 fsblock is between awu_min/max.  That simplifies a lot
of things. :)

Not supporting sub-fsblock atomic writes means that we'll never hit the
directio COW fallback code, which uses the pagecache.

Not supporting multi-fsblock atomic writes means that you don't have to
figure out how to ensure that we always do cow on forcealign
granularity.  Though as I pointed out elsewhere in this thread, that's a
forcealign problem.

Yay! ;)

> +}
> +
>  STATIC int
>  xfs_vn_getattr(
>  	struct mnt_idmap	*idmap,
> @@ -643,6 +660,13 @@ xfs_vn_getattr(
>  			stat->dio_mem_align = bdev_dma_alignment(bdev) + 1;
>  			stat->dio_offset_align = bdev_logical_block_size(bdev);
>  		}
> +		if (request_mask & STATX_WRITE_ATOMIC) {
> +			unsigned int unit_min, unit_max;
> +
> +			xfs_get_atomic_write_attr(ip, &unit_min, &unit_max);
> +			generic_fill_statx_atomic_writes(stat,
> +				unit_min, unit_max);

Consistent indenting and wrapping, please:

			xfs_get_atomic_write_attr(ip, &unit_min,
					&unit_max);
			generic_fill_statx_atomic_writes(stat,
					unit_min, unit_max);


With that fixed,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D


> +		}
>  		fallthrough;
>  	default:
>  		stat->blksize = xfs_stat_blksize(ip);
> -- 
> 2.31.1
> 
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 6/7] xfs: Validate atomic writes
  2024-09-30 12:54 ` [PATCH v6 6/7] xfs: Validate atomic writes John Garry
@ 2024-09-30 16:41   ` Darrick J. Wong
  2024-10-01 13:22     ` John Garry
  0 siblings, 1 reply; 29+ messages in thread
From: Darrick J. Wong @ 2024-09-30 16:41 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, brauner, viro, jack, dchinner, hch, cem, linux-block,
	linux-kernel, linux-xfs, linux-fsdevel, hare, martin.petersen,
	catherine.hoang, mcgrof, ritesh.list, ojaswin

On Mon, Sep 30, 2024 at 12:54:37PM +0000, John Garry wrote:
> Validate that an atomic write adheres to length/offset rules. Currently
> we can only write a single FS block.
> 
> For an IOCB with IOCB_ATOMIC set to get as far as xfs_file_dio_write(),
> FMODE_CAN_ATOMIC_WRITE will need to be set for the file; for this,
> ATOMICWRITES flags would also need to be set for the inode.
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  fs/xfs/xfs_file.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 412b1d71b52b..fa6a44b88ecc 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -688,6 +688,13 @@ xfs_file_dio_write(
>  	struct xfs_buftarg      *target = xfs_inode_buftarg(ip);
>  	size_t			count = iov_iter_count(from);
>  
> +	if (iocb->ki_flags & IOCB_ATOMIC) {
> +		if (count != ip->i_mount->m_sb.sb_blocksize)
> +			return -EINVAL;
> +		if (!generic_atomic_write_valid(iocb, from))
> +			return -EINVAL;
> +	}

Does xfs_file_write_iter need a catch-all so that we don't fall back to
buffered write for a directio write that returns ENOTBLK?

	if (iocb->ki_flags & IOCB_DIRECT) {
		/*
		 * Allow a directio write to fall back to a buffered
		 * write *only* in the case that we're doing a reflink
		 * CoW.  In all other directio scenarios we do not
		 * allow an operation to fall back to buffered mode.
		 */
		ret = xfs_file_dio_write(iocb, from);
		if (ret != -ENOTBLK || (iocb->ki_flags & IOCB_ATOMIC))
			return ret;
	}

IIRC iomap_dio_rw can return ENOTBLK if pagecache invalidation fails for
the region that we're trying to directio write.

--D

> +
>  	/* direct I/O must be aligned to device logical sector size */
>  	if ((iocb->ki_pos | count) & target->bt_logical_sectormask)
>  		return -EINVAL;
> -- 
> 2.31.1
> 
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 7/7] xfs: Support setting FMODE_CAN_ATOMIC_WRITE
  2024-09-30 12:54 ` [PATCH v6 7/7] xfs: Support setting FMODE_CAN_ATOMIC_WRITE John Garry
@ 2024-09-30 16:41   ` Darrick J. Wong
  0 siblings, 0 replies; 29+ messages in thread
From: Darrick J. Wong @ 2024-09-30 16:41 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, brauner, viro, jack, dchinner, hch, cem, linux-block,
	linux-kernel, linux-xfs, linux-fsdevel, hare, martin.petersen,
	catherine.hoang, mcgrof, ritesh.list, ojaswin

On Mon, Sep 30, 2024 at 12:54:38PM +0000, John Garry wrote:
> For when an inode is enabled for atomic writes, set FMODE_CAN_ATOMIC_WRITE
> flag. Only direct IO is currently supported, so check for that also.
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>

Looks good,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_file.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index fa6a44b88ecc..a358657a1ae6 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1208,6 +1208,16 @@ xfs_file_remap_range(
>  	return remapped > 0 ? remapped : ret;
>  }
>  
> +static bool
> +xfs_file_open_can_atomicwrite(
> +	struct inode		*inode,
> +	struct file		*file)
> +{
> +	if (!(file->f_flags & O_DIRECT))
> +		return false;
> +	return xfs_inode_can_atomicwrite(XFS_I(inode));
> +}
> +
>  STATIC int
>  xfs_file_open(
>  	struct inode	*inode,
> @@ -1216,6 +1226,8 @@ xfs_file_open(
>  	if (xfs_is_shutdown(XFS_M(inode->i_sb)))
>  		return -EIO;
>  	file->f_mode |= FMODE_NOWAIT | FMODE_CAN_ODIRECT;
> +	if (xfs_file_open_can_atomicwrite(inode, file))
> +		file->f_mode |= FMODE_CAN_ATOMIC_WRITE;
>  	return generic_file_open(inode, file);
>  }
>  
> -- 
> 2.31.1
> 
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 4/7] xfs: Support FS_XFLAG_ATOMICWRITES
  2024-09-30 16:03   ` Darrick J. Wong
@ 2024-09-30 16:44     ` Darrick J. Wong
  2024-10-01  8:41     ` Christoph Hellwig
  2024-10-01 13:35     ` John Garry
  2 siblings, 0 replies; 29+ messages in thread
From: Darrick J. Wong @ 2024-09-30 16:44 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, brauner, viro, jack, dchinner, hch, cem, linux-block,
	linux-kernel, linux-xfs, linux-fsdevel, hare, martin.petersen,
	catherine.hoang, mcgrof, ritesh.list, ojaswin

On Mon, Sep 30, 2024 at 09:03:49AM -0700, Darrick J. Wong wrote:
> On Mon, Sep 30, 2024 at 12:54:35PM +0000, John Garry wrote:
> > Add initial support for new flag FS_XFLAG_ATOMICWRITES.
> > 
> > This flag is a file attribute that mirrors an ondisk inode flag.  Actual
> > support for untorn file writes (for now) depends on both the iflag and the
> > underlying storage devices, which we can only really check at statx and
> > pwritev2() time.  This is the same story as FS_XFLAG_DAX, which signals to
> > the fs that we should try to enable the fsdax IO path on the file (instead
> > of the regular page cache), but applications have to query STAT_ATTR_DAX to
> > find out if they really got that IO path.
> > 
> > Current kernel support for atomic writes is based on HW support (for atomic
> > writes). Since for regular files XFS has no way to specify extent alignment
> > or granularity, atomic write size is limited to the FS block size.
> > 
> > Signed-off-by: John Garry <john.g.garry@oracle.com>
> > ---
> >  fs/xfs/libxfs/xfs_format.h     | 11 ++++++++--
> >  fs/xfs/libxfs/xfs_inode_buf.c  | 38 ++++++++++++++++++++++++++++++++++
> >  fs/xfs/libxfs/xfs_inode_util.c |  6 ++++++
> >  fs/xfs/libxfs/xfs_sb.c         |  2 ++
> >  fs/xfs/xfs_buf.c               | 15 +++++++++++++-
> >  fs/xfs/xfs_buf.h               |  5 ++++-
> >  fs/xfs/xfs_buf_mem.c           |  2 +-
> >  fs/xfs/xfs_inode.h             |  5 +++++
> >  fs/xfs/xfs_ioctl.c             | 37 +++++++++++++++++++++++++++++++++
> >  fs/xfs/xfs_mount.h             |  2 ++
> >  fs/xfs/xfs_reflink.c           |  4 ++++
> >  fs/xfs/xfs_super.c             |  4 ++++
> >  include/uapi/linux/fs.h        |  1 +
> >  13 files changed, 127 insertions(+), 5 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > index e1bfee0c3b1a..ed5e5442f0d4 100644
> > --- a/fs/xfs/libxfs/xfs_format.h
> > +++ b/fs/xfs/libxfs/xfs_format.h
> > @@ -352,11 +352,15 @@ xfs_sb_has_compat_feature(
> >  #define XFS_SB_FEAT_RO_COMPAT_RMAPBT   (1 << 1)		/* reverse map btree */
> >  #define XFS_SB_FEAT_RO_COMPAT_REFLINK  (1 << 2)		/* reflinked files */
> >  #define XFS_SB_FEAT_RO_COMPAT_INOBTCNT (1 << 3)		/* inobt block counts */
> > +#define XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES (1 << 31)	/* atomicwrites enabled */
> > +
> >  #define XFS_SB_FEAT_RO_COMPAT_ALL \
> >  		(XFS_SB_FEAT_RO_COMPAT_FINOBT | \
> >  		 XFS_SB_FEAT_RO_COMPAT_RMAPBT | \
> >  		 XFS_SB_FEAT_RO_COMPAT_REFLINK| \
> > -		 XFS_SB_FEAT_RO_COMPAT_INOBTCNT)
> > +		 XFS_SB_FEAT_RO_COMPAT_INOBTCNT | \
> > +		 XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES)
> > +
> >  #define XFS_SB_FEAT_RO_COMPAT_UNKNOWN	~XFS_SB_FEAT_RO_COMPAT_ALL
> >  static inline bool
> >  xfs_sb_has_ro_compat_feature(
> > @@ -1093,16 +1097,19 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
> >  #define XFS_DIFLAG2_COWEXTSIZE_BIT   2  /* copy on write extent size hint */
> >  #define XFS_DIFLAG2_BIGTIME_BIT	3	/* big timestamps */
> >  #define XFS_DIFLAG2_NREXT64_BIT 4	/* large extent counters */
> > +#define XFS_DIFLAG2_ATOMICWRITES_BIT 5	/* atomic writes permitted */
> >  
> >  #define XFS_DIFLAG2_DAX		(1 << XFS_DIFLAG2_DAX_BIT)
> >  #define XFS_DIFLAG2_REFLINK     (1 << XFS_DIFLAG2_REFLINK_BIT)
> >  #define XFS_DIFLAG2_COWEXTSIZE  (1 << XFS_DIFLAG2_COWEXTSIZE_BIT)
> >  #define XFS_DIFLAG2_BIGTIME	(1 << XFS_DIFLAG2_BIGTIME_BIT)
> >  #define XFS_DIFLAG2_NREXT64	(1 << XFS_DIFLAG2_NREXT64_BIT)
> > +#define XFS_DIFLAG2_ATOMICWRITES	(1 << XFS_DIFLAG2_ATOMICWRITES_BIT)
> >  
> >  #define XFS_DIFLAG2_ANY \
> >  	(XFS_DIFLAG2_DAX | XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE | \
> > -	 XFS_DIFLAG2_BIGTIME | XFS_DIFLAG2_NREXT64)
> > +	 XFS_DIFLAG2_BIGTIME | XFS_DIFLAG2_NREXT64 | \
> > +	 XFS_DIFLAG2_ATOMICWRITES)
> >  
> >  static inline bool xfs_dinode_has_bigtime(const struct xfs_dinode *dip)
> >  {
> > diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> > index 79babeac9d75..1e852cdd1d6f 100644
> > --- a/fs/xfs/libxfs/xfs_inode_buf.c
> > +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> > @@ -483,6 +483,36 @@ xfs_dinode_verify_nrext64(
> >  	return NULL;
> >  }
> >  
> > +static xfs_failaddr_t
> > +xfs_inode_validate_atomicwrites(
> > +	struct xfs_mount	*mp,
> > +	uint32_t		cowextsize,
> > +	uint16_t		mode,
> > +	int64_t			flags2)
> > +{
> > +	/* superblock rocompat feature flag */
> > +	if (!xfs_has_atomicwrites(mp))
> > +		return __this_address;
> > +
> > +	/* Only regular files and directories */
> > +	if (!S_ISREG(mode) && !(S_ISDIR(mode)))
> > +		return __this_address;
> > +
> > +	/* COW extsize disallowed */
> > +	if (flags2 & XFS_DIFLAG2_COWEXTSIZE)
> > +		return __this_address;
> > +
> > +	/* cowextsize must be zero */
> > +	if (cowextsize)
> > +		return __this_address;
> > +
> > +	/* reflink is disallowed */
> > +	if (flags2 & XFS_DIFLAG2_REFLINK)
> > +		return __this_address;
> 
> If we're only allowing atomic writes that are 1 fsblock or less, then
> copy on write will work correctly because CoWs are always done with
> fsblock granularity.  The ioend remap is also committed atomically.
> 
> IOWs, it's forcealign that isn't compatible with reflink and you can
> drop this incompatibility.

Also, why do you need to check cowextsize?  I thought cowextsize==0 is a
requirement of forcealign (because everything is keyed off extsize) and
is not something that atomicwrites itself requires?

Same applies to xfs_ioctl_setattr_atomicwrites.

--D

> > +
> > +	return NULL;
> > +}
> > +
> >  xfs_failaddr_t
> >  xfs_dinode_verify(
> >  	struct xfs_mount	*mp,
> > @@ -663,6 +693,14 @@ xfs_dinode_verify(
> >  	    !xfs_has_bigtime(mp))
> >  		return __this_address;
> >  
> > +	if (flags2 & XFS_DIFLAG2_ATOMICWRITES) {
> > +		fa = xfs_inode_validate_atomicwrites(mp,
> > +				be32_to_cpu(dip->di_cowextsize),
> 
> Technically speaking, the space used by di_cowextsize isn't defined on
> !reflink filesystems.  The contents are supposed to be zero, but nobody
> actually checks that, so you might want to special case this:
> 
> 		fa = xfs_inode_validate_atomicwrites(mp,
> 				xfs_has_reflink(mp) ?
> 					be32_to_cpu(dip->di_cowextsize) : 0,
> 				mode, flags2);
> 
> (inasmuch as this code is getting ugly and maybe you want to use a
> temporary variable)
> 
> > +				mode, flags2);
> > +		if (fa)
> > +			return fa;
> > +	}
> > +
> >  	return NULL;
> >  }
> >  
> > diff --git a/fs/xfs/libxfs/xfs_inode_util.c b/fs/xfs/libxfs/xfs_inode_util.c
> > index cc38e1c3c3e1..e59e98783bf7 100644
> > --- a/fs/xfs/libxfs/xfs_inode_util.c
> > +++ b/fs/xfs/libxfs/xfs_inode_util.c
> > @@ -80,6 +80,8 @@ xfs_flags2diflags2(
> >  		di_flags2 |= XFS_DIFLAG2_DAX;
> >  	if (xflags & FS_XFLAG_COWEXTSIZE)
> >  		di_flags2 |= XFS_DIFLAG2_COWEXTSIZE;
> > +	if (xflags & FS_XFLAG_ATOMICWRITES)
> > +		di_flags2 |= XFS_DIFLAG2_ATOMICWRITES;
> >  
> >  	return di_flags2;
> >  }
> > @@ -126,6 +128,8 @@ xfs_ip2xflags(
> >  			flags |= FS_XFLAG_DAX;
> >  		if (ip->i_diflags2 & XFS_DIFLAG2_COWEXTSIZE)
> >  			flags |= FS_XFLAG_COWEXTSIZE;
> > +		if (ip->i_diflags2 & XFS_DIFLAG2_ATOMICWRITES)
> > +			flags |= FS_XFLAG_ATOMICWRITES;
> >  	}
> >  
> >  	if (xfs_inode_has_attr_fork(ip))
> > @@ -224,6 +228,8 @@ xfs_inode_inherit_flags2(
> >  	}
> >  	if (pip->i_diflags2 & XFS_DIFLAG2_DAX)
> >  		ip->i_diflags2 |= XFS_DIFLAG2_DAX;
> > +	if (pip->i_diflags2 & XFS_DIFLAG2_ATOMICWRITES)
> > +		ip->i_diflags2 |= XFS_DIFLAG2_ATOMICWRITES;
> >  
> >  	/* Don't let invalid cowextsize hints propagate. */
> >  	failaddr = xfs_inode_validate_cowextsize(ip->i_mount, ip->i_cowextsize,
> > diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
> > index d95409f3cba6..dd819561d0a5 100644
> > --- a/fs/xfs/libxfs/xfs_sb.c
> > +++ b/fs/xfs/libxfs/xfs_sb.c
> > @@ -164,6 +164,8 @@ xfs_sb_version_to_features(
> >  		features |= XFS_FEAT_REFLINK;
> >  	if (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_INOBTCNT)
> >  		features |= XFS_FEAT_INOBTCNT;
> > +	if (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES)
> > +		features |= XFS_FEAT_ATOMICWRITES;
> >  	if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_FTYPE)
> >  		features |= XFS_FEAT_FTYPE;
> >  	if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_SPINODES)
> > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > index aa4dbda7b536..44bee3e2b2bb 100644
> > --- a/fs/xfs/xfs_buf.c
> > +++ b/fs/xfs/xfs_buf.c
> > @@ -2060,6 +2060,8 @@ int
> >  xfs_init_buftarg(
> >  	struct xfs_buftarg		*btp,
> >  	size_t				logical_sectorsize,
> > +	unsigned int			awu_min,
> > +	unsigned int			awu_max,
> >  	const char			*descr)
> >  {
> >  	/* Set up device logical sector size mask */
> > @@ -2086,6 +2088,9 @@ xfs_init_buftarg(
> >  	btp->bt_shrinker->scan_objects = xfs_buftarg_shrink_scan;
> >  	btp->bt_shrinker->private_data = btp;
> >  	shrinker_register(btp->bt_shrinker);
> > +
> > +	btp->bt_bdev_awu_min = awu_min;
> > +	btp->bt_bdev_awu_max = awu_max;
> >  	return 0;
> >  
> >  out_destroy_io_count:
> > @@ -2102,6 +2107,7 @@ xfs_alloc_buftarg(
> >  {
> >  	struct xfs_buftarg	*btp;
> >  	const struct dax_holder_operations *ops = NULL;
> > +	unsigned int awu_min = 0, awu_max = 0;
> >  
> >  #if defined(CONFIG_FS_DAX) && defined(CONFIG_MEMORY_FAILURE)
> >  	ops = &xfs_dax_holder_operations;
> > @@ -2115,6 +2121,13 @@ xfs_alloc_buftarg(
> >  	btp->bt_daxdev = fs_dax_get_by_bdev(btp->bt_bdev, &btp->bt_dax_part_off,
> >  					    mp, ops);
> >  
> > +	if (bdev_can_atomic_write(btp->bt_bdev)) {
> > +		struct request_queue *q = bdev_get_queue(btp->bt_bdev);
> > +
> > +		awu_min = queue_atomic_write_unit_min_bytes(q);
> > +		awu_max = queue_atomic_write_unit_max_bytes(q);
> > +	}
> > +
> >  	/*
> >  	 * When allocating the buftargs we have not yet read the super block and
> >  	 * thus don't know the file system sector size yet.
> > @@ -2122,7 +2135,7 @@ xfs_alloc_buftarg(
> >  	if (xfs_setsize_buftarg(btp, bdev_logical_block_size(btp->bt_bdev)))
> >  		goto error_free;
> >  	if (xfs_init_buftarg(btp, bdev_logical_block_size(btp->bt_bdev),
> > -			mp->m_super->s_id))
> > +			awu_min, awu_max, mp->m_super->s_id))
> >  		goto error_free;
> 
> Rather than passing this into the constructor and making the xmbuf code
> pass zeroes, why not set the awu values here in xfs_alloc_buftarg just
> before returning btp?
> 
> 	if (bdev_can_atomic_write(btp->bt_bdev)) {
> 		struct request_queue *q = bdev_get_queue(btp->bt_bdev);
> 
> 		btp->bt_bdev_awu_min = queue_atomic_write_unit_min_bytes(q);
> 		btp->bt_bdev_awu_max = queue_atomic_write_unit_max_bytes(q);
> 	}
> 
> 
> --D
> 
> >  	return btp;
> > diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
> > index 209a389f2abc..b813cb60a8f3 100644
> > --- a/fs/xfs/xfs_buf.h
> > +++ b/fs/xfs/xfs_buf.h
> > @@ -124,6 +124,9 @@ struct xfs_buftarg {
> >  	struct percpu_counter	bt_io_count;
> >  	struct ratelimit_state	bt_ioerror_rl;
> >  
> > +	/* Atomic write unit values */
> > +	unsigned int		bt_bdev_awu_min, bt_bdev_awu_max;
> > +
> >  	/* built-in cache, if we're not using the perag one */
> >  	struct xfs_buf_cache	bt_cache[];
> >  };
> > @@ -393,7 +396,7 @@ bool xfs_verify_magic16(struct xfs_buf *bp, __be16 dmagic);
> >  
> >  /* for xfs_buf_mem.c only: */
> >  int xfs_init_buftarg(struct xfs_buftarg *btp, size_t logical_sectorsize,
> > -		const char *descr);
> > +		unsigned int awu_min, unsigned int awu_max, const char *descr);
> >  void xfs_destroy_buftarg(struct xfs_buftarg *btp);
> >  
> >  #endif	/* __XFS_BUF_H__ */
> > diff --git a/fs/xfs/xfs_buf_mem.c b/fs/xfs/xfs_buf_mem.c
> > index 07bebbfb16ee..722d75f89767 100644
> > --- a/fs/xfs/xfs_buf_mem.c
> > +++ b/fs/xfs/xfs_buf_mem.c
> > @@ -93,7 +93,7 @@ xmbuf_alloc(
> >  	btp->bt_meta_sectorsize = XMBUF_BLOCKSIZE;
> >  	btp->bt_meta_sectormask = XMBUF_BLOCKSIZE - 1;
> >  
> > -	error = xfs_init_buftarg(btp, XMBUF_BLOCKSIZE, descr);
> > +	error = xfs_init_buftarg(btp, XMBUF_BLOCKSIZE, 0, 0, descr);
> >  	if (error)
> >  		goto out_bcache;
> >  
> > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> > index 97ed912306fd..1c62ee294a5a 100644
> > --- a/fs/xfs/xfs_inode.h
> > +++ b/fs/xfs/xfs_inode.h
> > @@ -327,6 +327,11 @@ static inline bool xfs_inode_has_bigrtalloc(struct xfs_inode *ip)
> >  	(XFS_IS_REALTIME_INODE(ip) ? \
> >  		(ip)->i_mount->m_rtdev_targp : (ip)->i_mount->m_ddev_targp)
> >  
> > +static inline bool xfs_inode_has_atomicwrites(struct xfs_inode *ip)
> > +{
> > +	return ip->i_diflags2 & XFS_DIFLAG2_ATOMICWRITES;
> > +}
> > +
> >  /*
> >   * In-core inode flags.
> >   */
> > diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> > index a20d426ef021..81872c32dcb2 100644
> > --- a/fs/xfs/xfs_ioctl.c
> > +++ b/fs/xfs/xfs_ioctl.c
> > @@ -469,6 +469,36 @@ xfs_fileattr_get(
> >  	return 0;
> >  }
> >  
> > +static int
> > +xfs_ioctl_setattr_atomicwrites(
> > +	struct xfs_inode	*ip,
> > +	struct fileattr		*fa)
> > +{
> > +	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
> > +	struct xfs_mount	*mp = ip->i_mount;
> > +	struct xfs_sb		*sbp = &mp->m_sb;
> > +
> > +	if (!xfs_has_atomicwrites(mp))
> > +		return -EINVAL;
> > +
> > +	if (target->bt_bdev_awu_min > sbp->sb_blocksize)
> > +		return -EINVAL;
> > +
> > +	if (target->bt_bdev_awu_max < sbp->sb_blocksize)
> > +		return -EINVAL;
> > +
> > +	if (xfs_is_reflink_inode(ip))
> > +		return -EINVAL;
> > +
> > +	if (fa->fsx_xflags & FS_XFLAG_COWEXTSIZE)
> > +		return -EINVAL;
> > +
> > +	if (fa->fsx_cowextsize)
> > +		return -EINVAL;
> > +
> > +	return 0;
> > +}
> > +
> >  static int
> >  xfs_ioctl_setattr_xflags(
> >  	struct xfs_trans	*tp,
> > @@ -478,6 +508,7 @@ xfs_ioctl_setattr_xflags(
> >  	struct xfs_mount	*mp = ip->i_mount;
> >  	bool			rtflag = (fa->fsx_xflags & FS_XFLAG_REALTIME);
> >  	uint64_t		i_flags2;
> > +	int			error;
> >  
> >  	if (rtflag != XFS_IS_REALTIME_INODE(ip)) {
> >  		/* Can't change realtime flag if any extents are allocated. */
> > @@ -512,6 +543,12 @@ xfs_ioctl_setattr_xflags(
> >  	if (i_flags2 && !xfs_has_v3inodes(mp))
> >  		return -EINVAL;
> >  
> > +	if (fa->fsx_xflags & FS_XFLAG_ATOMICWRITES) {
> > +		error = xfs_ioctl_setattr_atomicwrites(ip, fa);
> > +		if (error)
> > +			return error;
> > +	}
> > +
> >  	ip->i_diflags = xfs_flags2diflags(ip, fa->fsx_xflags);
> >  	ip->i_diflags2 = i_flags2;
> >  
> > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> > index 96496f39f551..6ac6518a2ef3 100644
> > --- a/fs/xfs/xfs_mount.h
> > +++ b/fs/xfs/xfs_mount.h
> > @@ -298,6 +298,7 @@ typedef struct xfs_mount {
> >  #define XFS_FEAT_NEEDSREPAIR	(1ULL << 25)	/* needs xfs_repair */
> >  #define XFS_FEAT_NREXT64	(1ULL << 26)	/* large extent counters */
> >  #define XFS_FEAT_EXCHANGE_RANGE	(1ULL << 27)	/* exchange range */
> > +#define XFS_FEAT_ATOMICWRITES	(1ULL << 28)	/* atomic writes support */
> >  
> >  /* Mount features */
> >  #define XFS_FEAT_NOATTR2	(1ULL << 48)	/* disable attr2 creation */
> > @@ -384,6 +385,7 @@ __XFS_ADD_V4_FEAT(projid32, PROJID32)
> >  __XFS_HAS_V4_FEAT(v3inodes, V3INODES)
> >  __XFS_HAS_V4_FEAT(crc, CRC)
> >  __XFS_HAS_V4_FEAT(pquotino, PQUOTINO)
> > +__XFS_HAS_FEAT(atomicwrites, ATOMICWRITES)
> >  
> >  /*
> >   * Mount features
> > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > index 6fde6ec8092f..6679b12a56c9 100644
> > --- a/fs/xfs/xfs_reflink.c
> > +++ b/fs/xfs/xfs_reflink.c
> > @@ -1471,6 +1471,10 @@ xfs_reflink_remap_prep(
> >  	if (XFS_IS_REALTIME_INODE(src) || XFS_IS_REALTIME_INODE(dest))
> >  		goto out_unlock;
> >  
> > +	/* Don't reflink atomic write inodes */
> > +	if (xfs_inode_has_atomicwrites(src) || xfs_inode_has_atomicwrites(dest))
> > +		goto out_unlock;
> > +
> >  	/* Don't share DAX file data with non-DAX file. */
> >  	if (IS_DAX(inode_in) != IS_DAX(inode_out))
> >  		goto out_unlock;
> > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > index fbb3a1594c0d..97c1d9493cdb 100644
> > --- a/fs/xfs/xfs_super.c
> > +++ b/fs/xfs/xfs_super.c
> > @@ -1733,6 +1733,10 @@ xfs_fs_fill_super(
> >  		mp->m_features &= ~XFS_FEAT_DISCARD;
> >  	}
> >  
> > +	if (xfs_has_atomicwrites(mp))
> > +		xfs_warn(mp,
> > +	"EXPERIMENTAL atomicwrites feature in use. Use at your own risk!");
> > +
> >  	if (xfs_has_reflink(mp)) {
> >  		if (mp->m_sb.sb_rblocks) {
> >  			xfs_alert(mp,
> > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> > index 753971770733..e813217e0fe4 100644
> > --- a/include/uapi/linux/fs.h
> > +++ b/include/uapi/linux/fs.h
> > @@ -158,6 +158,7 @@ struct fsxattr {
> >  #define FS_XFLAG_FILESTREAM	0x00004000	/* use filestream allocator */
> >  #define FS_XFLAG_DAX		0x00008000	/* use DAX for IO */
> >  #define FS_XFLAG_COWEXTSIZE	0x00010000	/* CoW extent size allocator hint */
> > +#define FS_XFLAG_ATOMICWRITES	0x00020000	/* atomic writes enabled */
> >  #define FS_XFLAG_HASATTR	0x80000000	/* no DIFLAG for this	*/
> >  
> >  /* the read-only stuff doesn't really belong here, but any other place is
> > -- 
> > 2.31.1
> > 
> > 
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 3/7] fs: iomap: Atomic write support
  2024-09-30 15:55   ` Darrick J. Wong
@ 2024-10-01  8:05     ` John Garry
  2024-10-01 14:37       ` Darrick J. Wong
  0 siblings, 1 reply; 29+ messages in thread
From: John Garry @ 2024-10-01  8:05 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: axboe, brauner, viro, jack, dchinner, hch, cem, linux-block,
	linux-kernel, linux-xfs, linux-fsdevel, hare, martin.petersen,
	catherine.hoang, mcgrof, ritesh.list, ojaswin


> 
> This new flag needs a documentation update.  What do you think of this?
> 
> diff --git a/Documentation/filesystems/iomap/operations.rst b/Documentation/filesystems/iomap/operations.rst
> index 8e6c721d23301..279db993be7fa 100644
> --- a/Documentation/filesystems/iomap/operations.rst
> +++ b/Documentation/filesystems/iomap/operations.rst
> @@ -513,6 +513,16 @@ IOMAP_WRITE`` with any combination of the following enhancements:
>      if the mapping is unwritten and the filesystem cannot handle zeroing
>      the unaligned regions without exposing stale contents.
>   
> + * ``IOMAP_ATOMIC``: This write must be persisted in its entirety or
> +   not at all.
> +   The write must not be split into multiple I/O requests.
> +   The file range to write must be aligned to satisfy the requirements
> +   of both the filesystem and the underlying block device's atomic
> +   commit capabilities.
> +   If filesystem metadata updates are required (e.g. unwritten extent
> +   conversion or copy on write), all updates for the entire file range
> +   must be committed atomically as well.
> +
>   Callers commonly hold ``i_rwsem`` in shared or exclusive mode before
>   calling this function.
>   

Sure, but I would make a couple of tweaks to the beginning:

  * ``IOMAP_ATOMIC``: This write is to be be issued with torn-write
    protection. Only a single bio can be created for the write, and the
    bio must not be split into multiple I/O requests, i.e. flag
    REQ_ATOMIC must be set.
    The file range to write must be aligned to satisfy the requirements
    of both the filesystem and the underlying block device's atomic
    commit capabilities.
    If filesystem metadata updates are required (e.g. unwritten extent
    conversion or copy on write), all updates for the entire file range
    must be committed atomically as well.

ok?

Thanks,
John



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 5/7] xfs: Support atomic write for statx
  2024-09-30 16:37   ` Darrick J. Wong
@ 2024-10-01  8:29     ` John Garry
  2024-10-01  8:43     ` Christoph Hellwig
  1 sibling, 0 replies; 29+ messages in thread
From: John Garry @ 2024-10-01  8:29 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: axboe, brauner, viro, jack, dchinner, hch, cem, linux-block,
	linux-kernel, linux-xfs, linux-fsdevel, hare, martin.petersen,
	catherine.hoang, mcgrof, ritesh.list, ojaswin

On 30/09/2024 17:37, Darrick J. Wong wrote:
> On Mon, Sep 30, 2024 at 12:54:36PM +0000, John Garry wrote:
>> Support providing info on atomic write unit min and max for an inode.
>>
>> For simplicity, currently we limit the min at the FS block size. As for
>> max, we limit also at FS block size, as there is no current method to
>> guarantee extent alignment or granularity for regular files.
>>
>> Signed-off-by: John Garry <john.g.garry@oracle.com>
>> ---
>>   fs/xfs/xfs_inode.h | 17 +++++++++++++++++
>>   fs/xfs/xfs_iops.c  | 24 ++++++++++++++++++++++++
>>   2 files changed, 41 insertions(+)
>>
>> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
>> index 1c62ee294a5a..1ea73402d592 100644
>> --- a/fs/xfs/xfs_inode.h
>> +++ b/fs/xfs/xfs_inode.h
>> @@ -332,6 +332,23 @@ static inline bool xfs_inode_has_atomicwrites(struct xfs_inode *ip)
>>   	return ip->i_diflags2 & XFS_DIFLAG2_ATOMICWRITES;
>>   }
>>   
>> +static inline bool
>> +xfs_inode_can_atomicwrite(
>> +	struct xfs_inode	*ip)
>> +{
>> +	struct xfs_mount	*mp = ip->i_mount;
>> +	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
>> +
>> +	if (!xfs_inode_has_atomicwrites(ip))
>> +		return false;
>> +	if (mp->m_sb.sb_blocksize < target->bt_bdev_awu_min)
>> +		return false;
>> +	if (mp->m_sb.sb_blocksize > target->bt_bdev_awu_max)
>> +		return false;
>> +
>> +	return true;
>> +}
>> +
>>   /*
>>    * In-core inode flags.
>>    */
>> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
>> index ee79cf161312..915d057db9bb 100644
>> --- a/fs/xfs/xfs_iops.c
>> +++ b/fs/xfs/xfs_iops.c
>> @@ -570,6 +570,23 @@ xfs_stat_blksize(
>>   	return max_t(uint32_t, PAGE_SIZE, mp->m_sb.sb_blocksize);
>>   }
>>   
>> +static void
>> +xfs_get_atomic_write_attr(
>> +	struct xfs_inode	*ip,
>> +	unsigned int		*unit_min,
>> +	unsigned int		*unit_max)
>> +{
>> +	struct xfs_mount	*mp = ip->i_mount;
>> +	struct xfs_sb		*sbp = &mp->m_sb;
>> +
>> +	if (!xfs_inode_can_atomicwrite(ip)) {
>> +		*unit_min = *unit_max = 0;
>> +		return;
>> +	}
>> +
>> +	*unit_min = *unit_max = sbp->sb_blocksize;
> 
> Ok, so we're only supporting untorn writes if they're exactly the fs
> blocksize, and 1 fsblock is between awu_min/max.  That simplifies a lot
> of things. :)
> 
> Not supporting sub-fsblock atomic writes means that we'll never hit the
> directio COW fallback code, which uses the pagecache.

My original idea (with forcealign) was to support 1FSB and larger.

I suppose that with a larger FS block size we might want to support 
sub-fsblock atomic writes. However, for the moment, I don't see a need 
to support this.

> 
> Not supporting multi-fsblock atomic writes means that you don't have to
> figure out how to ensure that we always do cow on forcealign
> granularity.  Though as I pointed out elsewhere in this thread, that's a
> forcealign problem.

Sure

> 
> Yay! ;)
> 
>> +}
>> +
>>   STATIC int
>>   xfs_vn_getattr(
>>   	struct mnt_idmap	*idmap,
>> @@ -643,6 +660,13 @@ xfs_vn_getattr(
>>   			stat->dio_mem_align = bdev_dma_alignment(bdev) + 1;
>>   			stat->dio_offset_align = bdev_logical_block_size(bdev);
>>   		}
>> +		if (request_mask & STATX_WRITE_ATOMIC) {
>> +			unsigned int unit_min, unit_max;
>> +
>> +			xfs_get_atomic_write_attr(ip, &unit_min, &unit_max);
>> +			generic_fill_statx_atomic_writes(stat,
>> +				unit_min, unit_max);
> 
> Consistent indenting and wrapping, please:

ok

> 
> 			xfs_get_atomic_write_attr(ip, &unit_min,
> 					&unit_max);
> 			generic_fill_statx_atomic_writes(stat,
> 					unit_min, unit_max);
> 
> 
> With that fixed,
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>

ok, thanks!

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 1/7] block/fs: Pass an iocb to generic_atomic_write_valid()
  2024-09-30 12:54 ` [PATCH v6 1/7] block/fs: Pass an iocb to generic_atomic_write_valid() John Garry
@ 2024-10-01  8:39   ` Christoph Hellwig
  0 siblings, 0 replies; 29+ messages in thread
From: Christoph Hellwig @ 2024-10-01  8:39 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, brauner, djwong, viro, jack, dchinner, hch, cem,
	linux-block, linux-kernel, linux-xfs, linux-fsdevel, hare,
	martin.petersen, catherine.hoang, mcgrof, ritesh.list, ojaswin

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 4/7] xfs: Support FS_XFLAG_ATOMICWRITES
  2024-09-30 16:03   ` Darrick J. Wong
  2024-09-30 16:44     ` Darrick J. Wong
@ 2024-10-01  8:41     ` Christoph Hellwig
  2024-10-01 12:05       ` John Garry
  2024-10-01 13:35     ` John Garry
  2 siblings, 1 reply; 29+ messages in thread
From: Christoph Hellwig @ 2024-10-01  8:41 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: John Garry, axboe, brauner, viro, jack, dchinner, hch, cem,
	linux-block, linux-kernel, linux-xfs, linux-fsdevel, hare,
	martin.petersen, catherine.hoang, mcgrof, ritesh.list, ojaswin

On Mon, Sep 30, 2024 at 09:03:49AM -0700, Darrick J. Wong wrote:
> If we're only allowing atomic writes that are 1 fsblock or less, then
> copy on write will work correctly because CoWs are always done with
> fsblock granularity.  The ioend remap is also committed atomically.
> 
> IOWs, it's forcealign that isn't compatible with reflink and you can
> drop this incompatibility.

That was my thought as well when reading through this patch.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 2/7] fs: Export generic_atomic_write_valid()
  2024-09-30 12:54 ` [PATCH v6 2/7] fs: Export generic_atomic_write_valid() John Garry
@ 2024-10-01  8:41   ` Christoph Hellwig
  0 siblings, 0 replies; 29+ messages in thread
From: Christoph Hellwig @ 2024-10-01  8:41 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, brauner, djwong, viro, jack, dchinner, hch, cem,
	linux-block, linux-kernel, linux-xfs, linux-fsdevel, hare,
	martin.petersen, catherine.hoang, mcgrof, ritesh.list, ojaswin

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 5/7] xfs: Support atomic write for statx
  2024-09-30 16:37   ` Darrick J. Wong
  2024-10-01  8:29     ` John Garry
@ 2024-10-01  8:43     ` Christoph Hellwig
  1 sibling, 0 replies; 29+ messages in thread
From: Christoph Hellwig @ 2024-10-01  8:43 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: John Garry, axboe, brauner, viro, jack, dchinner, hch, cem,
	linux-block, linux-kernel, linux-xfs, linux-fsdevel, hare,
	martin.petersen, catherine.hoang, mcgrof, ritesh.list, ojaswin

On Mon, Sep 30, 2024 at 09:37:16AM -0700, Darrick J. Wong wrote:
> Ok, so we're only supporting untorn writes if they're exactly the fs
> blocksize, and 1 fsblock is between awu_min/max.  That simplifies a lot
> of things. :)
> 
> Not supporting sub-fsblock atomic writes means that we'll never hit the
> directio COW fallback code, which uses the pagecache.
> 
> Not supporting multi-fsblock atomic writes means that you don't have to
> figure out how to ensure that we always do cow on forcealign
> granularity.  Though as I pointed out elsewhere in this thread, that's a
> forcealign problem.

It does simplify things a lot, and is probably a good idea for
the initial version.  But I suspect support for atomic writes
smaller than the block size will be really useful and we should
eventually support them, but maybe now on reflinked files or
alwayscow.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 4/7] xfs: Support FS_XFLAG_ATOMICWRITES
  2024-10-01  8:41     ` Christoph Hellwig
@ 2024-10-01 12:05       ` John Garry
  0 siblings, 0 replies; 29+ messages in thread
From: John Garry @ 2024-10-01 12:05 UTC (permalink / raw)
  To: Christoph Hellwig, Darrick J. Wong
  Cc: axboe, brauner, viro, jack, dchinner, cem, linux-block,
	linux-kernel, linux-xfs, linux-fsdevel, hare, martin.petersen,
	catherine.hoang, mcgrof, ritesh.list, ojaswin

On 01/10/2024 09:41, Christoph Hellwig wrote:
> On Mon, Sep 30, 2024 at 09:03:49AM -0700, Darrick J. Wong wrote:
>> If we're only allowing atomic writes that are 1 fsblock or less, then
>> copy on write will work correctly because CoWs are always done with
>> fsblock granularity.  The ioend remap is also committed atomically.
>>
>> IOWs, it's forcealign that isn't compatible with reflink and you can
>> drop this incompatibility.
> That was my thought as well when reading through this patch.

ok, fine. I just thought that we don't want to mix reflinked files and 
atomic writes at all. I do understand that there was also an 
incompatibility between forcealign and reflink.

Cheers,
John

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 6/7] xfs: Validate atomic writes
  2024-09-30 16:41   ` Darrick J. Wong
@ 2024-10-01 13:22     ` John Garry
  2024-10-01 14:48       ` Darrick J. Wong
  0 siblings, 1 reply; 29+ messages in thread
From: John Garry @ 2024-10-01 13:22 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: axboe, brauner, viro, jack, dchinner, hch, cem, linux-block,
	linux-kernel, linux-xfs, linux-fsdevel, hare, martin.petersen,
	catherine.hoang, mcgrof, ritesh.list, ojaswin

On 30/09/2024 17:41, Darrick J. Wong wrote:
> On Mon, Sep 30, 2024 at 12:54:37PM +0000, John Garry wrote:
>> Validate that an atomic write adheres to length/offset rules. Currently
>> we can only write a single FS block.
>>
>> For an IOCB with IOCB_ATOMIC set to get as far as xfs_file_dio_write(),
>> FMODE_CAN_ATOMIC_WRITE will need to be set for the file; for this,
>> ATOMICWRITES flags would also need to be set for the inode.
>>
>> Signed-off-by: John Garry <john.g.garry@oracle.com>
>> ---
>>   fs/xfs/xfs_file.c | 7 +++++++
>>   1 file changed, 7 insertions(+)
>>
>> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
>> index 412b1d71b52b..fa6a44b88ecc 100644
>> --- a/fs/xfs/xfs_file.c
>> +++ b/fs/xfs/xfs_file.c
>> @@ -688,6 +688,13 @@ xfs_file_dio_write(
>>   	struct xfs_buftarg      *target = xfs_inode_buftarg(ip);
>>   	size_t			count = iov_iter_count(from);
>>   
>> +	if (iocb->ki_flags & IOCB_ATOMIC) {
>> +		if (count != ip->i_mount->m_sb.sb_blocksize)
>> +			return -EINVAL;
>> +		if (!generic_atomic_write_valid(iocb, from))
>> +			return -EINVAL;
>> +	}
> 
> Does xfs_file_write_iter need a catch-all so that we don't fall back to
> buffered write for a directio write that returns ENOTBLK?
> 
> 	if (iocb->ki_flags & IOCB_DIRECT) {
> 		/*
> 		 * Allow a directio write to fall back to a buffered
> 		 * write *only* in the case that we're doing a reflink
> 		 * CoW.  In all other directio scenarios we do not
> 		 * allow an operation to fall back to buffered mode.
> 		 */
> 		ret = xfs_file_dio_write(iocb, from);
> 		if (ret != -ENOTBLK || (iocb->ki_flags & IOCB_ATOMIC))
> 			return ret;
> 	}
> 
> IIRC iomap_dio_rw can return ENOTBLK if pagecache invalidation fails for
> the region that we're trying to directio write.

I see where you are talking about. There is also a ENOTBLK from 
unaligned write for CoW, but we would not see that.

But I was thinking to use a common helper to catch this, like 
generic_write_checks_count() [which is called on the buffered IO path]:

----8<-----

diff --git a/fs/read_write.c b/fs/read_write.c
index 32b476bf9be0..222f25c6439c 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1774,6 +1774,10 @@ int generic_write_checks_count(struct kiocb 
*iocb, loff_t *count)
  	if (!*count)
  		return 0;

+	if (iocb->ki_flags & IOCB_ATOMIC &&
+	    !(iocb->ki_flags & IOCB_DIRECT))
+		return -EINVAL;
+
  	if (iocb->ki_flags & IOCB_APPEND)
  		iocb->ki_pos = i_size_read(inode);

---->8-----

But we keep the IOCB_DIRECT flag for the buffered IO fallback (so no good).

Another option would be:

----8<-----

--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -679,7 +679,12 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter 
*iter,
  			if (ret != -EAGAIN) {
  				trace_iomap_dio_invalidate_fail(inode, iomi.pos,
  								iomi.len);
-				ret = -ENOTBLK;
+				if (iocb->ki_flags & IOCB_ATOMIC) {
+					if (ret == -ENOTBLK)
+						ret = -EAGAIN;
+				}else {
+					ret = -ENOTBLK;
+				}
  			}
  			goto out_free_dio;
  		}
---->8-----

I suggest that, as other FSes (like ext4) handle -ENOTBLK and would need 
to be changed similar to XFS. But I am not sure if changing the error 
code from -ENOTBLK for IOCB_ATOMIC is ok.

Let me know what you think about possible alternative solutions.

Thanks,
John

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 4/7] xfs: Support FS_XFLAG_ATOMICWRITES
  2024-09-30 16:03   ` Darrick J. Wong
  2024-09-30 16:44     ` Darrick J. Wong
  2024-10-01  8:41     ` Christoph Hellwig
@ 2024-10-01 13:35     ` John Garry
  2 siblings, 0 replies; 29+ messages in thread
From: John Garry @ 2024-10-01 13:35 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: axboe, brauner, viro, jack, dchinner, hch, cem, linux-block,
	linux-kernel, linux-xfs, linux-fsdevel, hare, martin.petersen,
	catherine.hoang, mcgrof, ritesh.list, ojaswin


>>   }
>>   
>> +static xfs_failaddr_t
>> +xfs_inode_validate_atomicwrites(
>> +	struct xfs_mount	*mp,
>> +	uint32_t		cowextsize,
>> +	uint16_t		mode,
>> +	int64_t			flags2)
>> +{
>> +	/* superblock rocompat feature flag */
>> +	if (!xfs_has_atomicwrites(mp))
>> +		return __this_address;
>> +
>> +	/* Only regular files and directories */
>> +	if (!S_ISREG(mode) && !(S_ISDIR(mode)))
>> +		return __this_address;
>> +
>> +	/* COW extsize disallowed */
>> +	if (flags2 & XFS_DIFLAG2_COWEXTSIZE)
>> +		return __this_address;
>> +
>> +	/* cowextsize must be zero */
>> +	if (cowextsize)
>> +		return __this_address;
>> +
>> +	/* reflink is disallowed */
>> +	if (flags2 & XFS_DIFLAG2_REFLINK)
>> +		return __this_address;
> 
> If we're only allowing atomic writes that are 1 fsblock or less, then
> copy on write will work correctly because CoWs are always done with
> fsblock granularity.  The ioend remap is also committed atomically.
> 
> IOWs, it's forcealign that isn't compatible with reflink and you can
> drop this incompatibility.

ok, understood

> 
>> +
>> +	return NULL;
>> +}
>> +
>>   xfs_failaddr_t
>>   xfs_dinode_verify(
>>   	struct xfs_mount	*mp,
>> @@ -663,6 +693,14 @@ xfs_dinode_verify(
>>   	    !xfs_has_bigtime(mp))
>>   		return __this_address;
>>   
>> +	if (flags2 & XFS_DIFLAG2_ATOMICWRITES) {
>> +		fa = xfs_inode_validate_atomicwrites(mp,
>> +				be32_to_cpu(dip->di_cowextsize),
> 
> Technically speaking, the space used by di_cowextsize isn't defined on
> !reflink filesystems.  The contents are supposed to be zero, but nobody
> actually checks that, so you might want to special case this:
> 
> 		fa = xfs_inode_validate_atomicwrites(mp,
> 				xfs_has_reflink(mp) ?
> 					be32_to_cpu(dip->di_cowextsize) : 0,
> 				mode, flags2);
> 
> (inasmuch as this code is getting ugly and maybe you want to use a
> temporary variable)

As discussed later, I will drop the cowextsize check (so need to pass 
this at all)

> 
>> +				mode, flags2);
>> +		if (fa)
>> +			return fa;
>> +	}
>> +
>>   	return NULL;
>>   }
>>   
>> diff --git a/fs/xfs/libxfs/xfs_inode_util.c b/fs/xfs/libxfs/xfs_inode_util.c
>> index cc38e1c3c3e1..e59e98783bf7 100644
>> --- a/fs/xfs/libxfs/xfs_inode_util.c
>> +++ b/fs/xfs/libxfs/xfs_inode_util.c
>> @@ -80,6 +80,8 @@ xfs_flags2diflags2(
>>   		di_flags2 |= XFS_DIFLAG2_DAX;
>>   	if (xflags & FS_XFLAG_COWEXTSIZE)
>>   		di_flags2 |= XFS_DIFLAG2_COWEXTSIZE;
>> +	if (xflags & FS_XFLAG_ATOMICWRITES)
>> +		di_flags2 |= XFS_DIFLAG2_ATOMICWRITES;
>>   
>>   	return di_flags2;
>>   }
>> @@ -126,6 +128,8 @@ xfs_ip2xflags(
>>   			flags |= FS_XFLAG_DAX;
>>   		if (ip->i_diflags2 & XFS_DIFLAG2_COWEXTSIZE)
>>   			flags |= FS_XFLAG_COWEXTSIZE;
>> +		if (ip->i_diflags2 & XFS_DIFLAG2_ATOMICWRITES)
>> +			flags |= FS_XFLAG_ATOMICWRITES;
>>   	}
>>   
>>   	if (xfs_inode_has_attr_fork(ip))
>> @@ -224,6 +228,8 @@ xfs_inode_inherit_flags2(
>>   	}
>>   	if (pip->i_diflags2 & XFS_DIFLAG2_DAX)
>>   		ip->i_diflags2 |= XFS_DIFLAG2_DAX;
>> +	if (pip->i_diflags2 & XFS_DIFLAG2_ATOMICWRITES)
>> +		ip->i_diflags2 |= XFS_DIFLAG2_ATOMICWRITES;
>>   
>>   	/* Don't let invalid cowextsize hints propagate. */
>>   	failaddr = xfs_inode_validate_cowextsize(ip->i_mount, ip->i_cowextsize,
>> diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
>> index d95409f3cba6..dd819561d0a5 100644
>> --- a/fs/xfs/libxfs/xfs_sb.c
>> +++ b/fs/xfs/libxfs/xfs_sb.c
>> @@ -164,6 +164,8 @@ xfs_sb_version_to_features(
>>   		features |= XFS_FEAT_REFLINK;
>>   	if (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_INOBTCNT)
>>   		features |= XFS_FEAT_INOBTCNT;
>> +	if (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES)
>> +		features |= XFS_FEAT_ATOMICWRITES;
>>   	if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_FTYPE)
>>   		features |= XFS_FEAT_FTYPE;
>>   	if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_SPINODES)
>> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
>> index aa4dbda7b536..44bee3e2b2bb 100644
>> --- a/fs/xfs/xfs_buf.c
>> +++ b/fs/xfs/xfs_buf.c
>> @@ -2060,6 +2060,8 @@ int
>>   xfs_init_buftarg(
>>   	struct xfs_buftarg		*btp,
>>   	size_t				logical_sectorsize,
>> +	unsigned int			awu_min,
>> +	unsigned int			awu_max,
>>   	const char			*descr)
>>   {
>>   	/* Set up device logical sector size mask */
>> @@ -2086,6 +2088,9 @@ xfs_init_buftarg(
>>   	btp->bt_shrinker->scan_objects = xfs_buftarg_shrink_scan;
>>   	btp->bt_shrinker->private_data = btp;
>>   	shrinker_register(btp->bt_shrinker);
>> +
>> +	btp->bt_bdev_awu_min = awu_min;
>> +	btp->bt_bdev_awu_max = awu_max;
>>   	return 0;
>>   
>>   out_destroy_io_count:
>> @@ -2102,6 +2107,7 @@ xfs_alloc_buftarg(
>>   {
>>   	struct xfs_buftarg	*btp;
>>   	const struct dax_holder_operations *ops = NULL;
>> +	unsigned int awu_min = 0, awu_max = 0;
>>   
>>   #if defined(CONFIG_FS_DAX) && defined(CONFIG_MEMORY_FAILURE)
>>   	ops = &xfs_dax_holder_operations;
>> @@ -2115,6 +2121,13 @@ xfs_alloc_buftarg(
>>   	btp->bt_daxdev = fs_dax_get_by_bdev(btp->bt_bdev, &btp->bt_dax_part_off,
>>   					    mp, ops);
>>   
>> +	if (bdev_can_atomic_write(btp->bt_bdev)) {
>> +		struct request_queue *q = bdev_get_queue(btp->bt_bdev);
>> +
>> +		awu_min = queue_atomic_write_unit_min_bytes(q);
>> +		awu_max = queue_atomic_write_unit_max_bytes(q);
>> +	}
>> +
>>   	/*
>>   	 * When allocating the buftargs we have not yet read the super block and
>>   	 * thus don't know the file system sector size yet.
>> @@ -2122,7 +2135,7 @@ xfs_alloc_buftarg(
>>   	if (xfs_setsize_buftarg(btp, bdev_logical_block_size(btp->bt_bdev)))
>>   		goto error_free;
>>   	if (xfs_init_buftarg(btp, bdev_logical_block_size(btp->bt_bdev),
>> -			mp->m_super->s_id))
>> +			awu_min, awu_max, mp->m_super->s_id))
>>   		goto error_free;
> 
> Rather than passing this into the constructor and making the xmbuf code
> pass zeroes, why not set the awu values here in xfs_alloc_buftarg just
> before returning btp?

ok, I can do that instead.

Thanks,
John


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 3/7] fs: iomap: Atomic write support
  2024-10-01  8:05     ` John Garry
@ 2024-10-01 14:37       ` Darrick J. Wong
  0 siblings, 0 replies; 29+ messages in thread
From: Darrick J. Wong @ 2024-10-01 14:37 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, brauner, viro, jack, dchinner, hch, cem, linux-block,
	linux-kernel, linux-xfs, linux-fsdevel, hare, martin.petersen,
	catherine.hoang, mcgrof, ritesh.list, ojaswin

On Tue, Oct 01, 2024 at 09:05:17AM +0100, John Garry wrote:
> 
> > 
> > This new flag needs a documentation update.  What do you think of this?
> > 
> > diff --git a/Documentation/filesystems/iomap/operations.rst b/Documentation/filesystems/iomap/operations.rst
> > index 8e6c721d23301..279db993be7fa 100644
> > --- a/Documentation/filesystems/iomap/operations.rst
> > +++ b/Documentation/filesystems/iomap/operations.rst
> > @@ -513,6 +513,16 @@ IOMAP_WRITE`` with any combination of the following enhancements:
> >      if the mapping is unwritten and the filesystem cannot handle zeroing
> >      the unaligned regions without exposing stale contents.
> > + * ``IOMAP_ATOMIC``: This write must be persisted in its entirety or
> > +   not at all.
> > +   The write must not be split into multiple I/O requests.
> > +   The file range to write must be aligned to satisfy the requirements
> > +   of both the filesystem and the underlying block device's atomic
> > +   commit capabilities.
> > +   If filesystem metadata updates are required (e.g. unwritten extent
> > +   conversion or copy on write), all updates for the entire file range
> > +   must be committed atomically as well.
> > +
> >   Callers commonly hold ``i_rwsem`` in shared or exclusive mode before
> >   calling this function.
> 
> Sure, but I would make a couple of tweaks to the beginning:
> 
>  * ``IOMAP_ATOMIC``: This write is to be be issued with torn-write
>    protection. Only a single bio can be created for the write, and the
>    bio must not be split into multiple I/O requests, i.e. flag
>    REQ_ATOMIC must be set.
>    The file range to write must be aligned to satisfy the requirements
>    of both the filesystem and the underlying block device's atomic
>    commit capabilities.
>    If filesystem metadata updates are required (e.g. unwritten extent
>    conversion or copy on write), all updates for the entire file range
>    must be committed atomically as well.
> 
> ok?

Yep, sounds good.

--D

> Thanks,
> John
> 
> 
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 6/7] xfs: Validate atomic writes
  2024-10-01 13:22     ` John Garry
@ 2024-10-01 14:48       ` Darrick J. Wong
  2024-10-01 15:48         ` John Garry
  0 siblings, 1 reply; 29+ messages in thread
From: Darrick J. Wong @ 2024-10-01 14:48 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, brauner, viro, jack, dchinner, hch, cem, linux-block,
	linux-kernel, linux-xfs, linux-fsdevel, hare, martin.petersen,
	catherine.hoang, mcgrof, ritesh.list, ojaswin

On Tue, Oct 01, 2024 at 02:22:23PM +0100, John Garry wrote:
> On 30/09/2024 17:41, Darrick J. Wong wrote:
> > On Mon, Sep 30, 2024 at 12:54:37PM +0000, John Garry wrote:
> > > Validate that an atomic write adheres to length/offset rules. Currently
> > > we can only write a single FS block.
> > > 
> > > For an IOCB with IOCB_ATOMIC set to get as far as xfs_file_dio_write(),
> > > FMODE_CAN_ATOMIC_WRITE will need to be set for the file; for this,
> > > ATOMICWRITES flags would also need to be set for the inode.
> > > 
> > > Signed-off-by: John Garry <john.g.garry@oracle.com>
> > > ---
> > >   fs/xfs/xfs_file.c | 7 +++++++
> > >   1 file changed, 7 insertions(+)
> > > 
> > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > > index 412b1d71b52b..fa6a44b88ecc 100644
> > > --- a/fs/xfs/xfs_file.c
> > > +++ b/fs/xfs/xfs_file.c
> > > @@ -688,6 +688,13 @@ xfs_file_dio_write(
> > >   	struct xfs_buftarg      *target = xfs_inode_buftarg(ip);
> > >   	size_t			count = iov_iter_count(from);
> > > +	if (iocb->ki_flags & IOCB_ATOMIC) {
> > > +		if (count != ip->i_mount->m_sb.sb_blocksize)
> > > +			return -EINVAL;
> > > +		if (!generic_atomic_write_valid(iocb, from))
> > > +			return -EINVAL;
> > > +	}
> > 
> > Does xfs_file_write_iter need a catch-all so that we don't fall back to
> > buffered write for a directio write that returns ENOTBLK?
> > 
> > 	if (iocb->ki_flags & IOCB_DIRECT) {
> > 		/*
> > 		 * Allow a directio write to fall back to a buffered
> > 		 * write *only* in the case that we're doing a reflink
> > 		 * CoW.  In all other directio scenarios we do not
> > 		 * allow an operation to fall back to buffered mode.
> > 		 */
> > 		ret = xfs_file_dio_write(iocb, from);
> > 		if (ret != -ENOTBLK || (iocb->ki_flags & IOCB_ATOMIC))
> > 			return ret;
> > 	}
> > 
> > IIRC iomap_dio_rw can return ENOTBLK if pagecache invalidation fails for
> > the region that we're trying to directio write.
> 
> I see where you are talking about. There is also a ENOTBLK from unaligned
> write for CoW, but we would not see that.
> 
> But I was thinking to use a common helper to catch this, like
> generic_write_checks_count() [which is called on the buffered IO path]:
> 
> ----8<-----
> 
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 32b476bf9be0..222f25c6439c 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1774,6 +1774,10 @@ int generic_write_checks_count(struct kiocb *iocb,
> loff_t *count)
>  	if (!*count)
>  		return 0;
> 
> +	if (iocb->ki_flags & IOCB_ATOMIC &&
> +	    !(iocb->ki_flags & IOCB_DIRECT))
> +		return -EINVAL;
> +
>  	if (iocb->ki_flags & IOCB_APPEND)
>  		iocb->ki_pos = i_size_read(inode);
> 
> ---->8-----
> 
> But we keep the IOCB_DIRECT flag for the buffered IO fallback (so no good).
> 
> Another option would be:
> 
> ----8<-----
> 
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -679,7 +679,12 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter
> *iter,
>  			if (ret != -EAGAIN) {
>  				trace_iomap_dio_invalidate_fail(inode, iomi.pos,
>  								iomi.len);
> -				ret = -ENOTBLK;
> +				if (iocb->ki_flags & IOCB_ATOMIC) {
> +					if (ret == -ENOTBLK)
> +						ret = -EAGAIN;

I don't follow the logic here -- all the error codes except for EAGAIN
are squashed into ENOTBLK, so why would we let them through for an
atomic write?

	if (ret != -EAGAIN) {
		trace_iomap_dio_invalidate_fail(inode, iomi.pos,
						iomi.len);

		if (iocb->ki_flags & IOCB_ATOMIC) {
			/*
			 * folio invalidation failed, maybe this is
			 * transient, unlock and see if the caller
			 * tries again
			 */
			return -EAGAIN;
		} else {
			/* fall back to buffered write */
			return -ENOTBLK;
		}
	}

--D

> +				}else {
> +					ret = -ENOTBLK;
> +				}
>  			}
>  			goto out_free_dio;
>  		}
> ---->8-----
> 
> I suggest that, as other FSes (like ext4) handle -ENOTBLK and would need to
> be changed similar to XFS. But I am not sure if changing the error code from
> -ENOTBLK for IOCB_ATOMIC is ok.
> 
> Let me know what you think about possible alternative solutions.
> 
> Thanks,
> John
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 6/7] xfs: Validate atomic writes
  2024-10-01 14:48       ` Darrick J. Wong
@ 2024-10-01 15:48         ` John Garry
  0 siblings, 0 replies; 29+ messages in thread
From: John Garry @ 2024-10-01 15:48 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: axboe, brauner, viro, jack, dchinner, hch, cem, linux-block,
	linux-kernel, linux-xfs, linux-fsdevel, hare, martin.petersen,
	catherine.hoang, mcgrof, ritesh.list, ojaswin

On 01/10/2024 15:48, Darrick J. Wong wrote:
>> --- a/fs/iomap/direct-io.c
>> +++ b/fs/iomap/direct-io.c
>> @@ -679,7 +679,12 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter
>> *iter,
>>   			if (ret != -EAGAIN) {
>>   				trace_iomap_dio_invalidate_fail(inode, iomi.pos,
>>   								iomi.len);
>> -				ret = -ENOTBLK;
>> +				if (iocb->ki_flags & IOCB_ATOMIC) {
>> +					if (ret == -ENOTBLK)
>> +						ret = -EAGAIN;
> I don't follow the logic here -- all the error codes except for EAGAIN
> are squashed into ENOTBLK, so why would we let them through for an
> atomic write?

Right, the errors apart from EAGAIN are normally squashed to ENOTBLK; 
but I thought that since we would not do this for IOCB_ATOMIC, then just 
return the actual error from kiocb_invalidate_pages() - apart from 
ENOTBLK, which has this special treatment.

But I can always just return EAGAIN for IOCB_ATOMIC when 
kiocb_invalidate_pages() errors, as you suggest below.

> 
> 	if (ret != -EAGAIN) {
> 		trace_iomap_dio_invalidate_fail(inode, iomi.pos,
> 						iomi.len);
> 
> 		if (iocb->ki_flags & IOCB_ATOMIC) {
> 			/*
> 			 * folio invalidation failed, maybe this is
> 			 * transient, unlock and see if the caller
> 			 * tries again
> 			 */
> 			return -EAGAIN;
> 		} else {
> 			/* fall back to buffered write */
> 			return -ENOTBLK;
> 		}
> 	}

Cheers,
John

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 4/7] xfs: Support FS_XFLAG_ATOMICWRITES
  2024-09-30 12:54 ` [PATCH v6 4/7] xfs: Support FS_XFLAG_ATOMICWRITES John Garry
  2024-09-30 16:03   ` Darrick J. Wong
@ 2024-10-03 12:48   ` John Garry
  2024-10-03 13:02     ` Christoph Hellwig
  1 sibling, 1 reply; 29+ messages in thread
From: John Garry @ 2024-10-03 12:48 UTC (permalink / raw)
  To: axboe, brauner, djwong, viro, jack, dchinner, hch, cem
  Cc: linux-block, linux-kernel, linux-xfs, linux-fsdevel, hare,
	martin.petersen, catherine.hoang, mcgrof, ritesh.list, ojaswin

On 30/09/2024 13:54, John Garry wrote:
> @@ -352,11 +352,15 @@ xfs_sb_has_compat_feature(
>   #define XFS_SB_FEAT_RO_COMPAT_RMAPBT   (1 << 1)		/* reverse map btree */
>   #define XFS_SB_FEAT_RO_COMPAT_REFLINK  (1 << 2)		/* reflinked files */
>   #define XFS_SB_FEAT_RO_COMPAT_INOBTCNT (1 << 3)		/* inobt block counts */
> +#define XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES (1 << 31)	/* atomicwrites enabled */
> +

BTW, Darrick, as you questioned previously, this does make xfs/270 
fail... until the change to a not use the top bit.

Thanks,
John

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 4/7] xfs: Support FS_XFLAG_ATOMICWRITES
  2024-10-03 12:48   ` John Garry
@ 2024-10-03 13:02     ` Christoph Hellwig
  2024-10-03 13:19       ` John Garry
  0 siblings, 1 reply; 29+ messages in thread
From: Christoph Hellwig @ 2024-10-03 13:02 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, brauner, djwong, viro, jack, dchinner, hch, cem,
	linux-block, linux-kernel, linux-xfs, linux-fsdevel, hare,
	martin.petersen, catherine.hoang, mcgrof, ritesh.list, ojaswin

On Thu, Oct 03, 2024 at 01:48:41PM +0100, John Garry wrote:
> On 30/09/2024 13:54, John Garry wrote:
>> @@ -352,11 +352,15 @@ xfs_sb_has_compat_feature(
>>   #define XFS_SB_FEAT_RO_COMPAT_RMAPBT   (1 << 1)		/* reverse map btree */
>>   #define XFS_SB_FEAT_RO_COMPAT_REFLINK  (1 << 2)		/* reflinked files */
>>   #define XFS_SB_FEAT_RO_COMPAT_INOBTCNT (1 << 3)		/* inobt block counts */
>> +#define XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES (1 << 31)	/* atomicwrites enabled */
>> +
>
> BTW, Darrick, as you questioned previously, this does make xfs/270 fail... 
> until the change to a not use the top bit.

With the large block size based atomic writes we shoudn't even need
a feature flag, or am I missing something?


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 4/7] xfs: Support FS_XFLAG_ATOMICWRITES
  2024-10-03 13:02     ` Christoph Hellwig
@ 2024-10-03 13:19       ` John Garry
  0 siblings, 0 replies; 29+ messages in thread
From: John Garry @ 2024-10-03 13:19 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: axboe, brauner, djwong, viro, jack, dchinner, cem, linux-block,
	linux-kernel, linux-xfs, linux-fsdevel, hare, martin.petersen,
	catherine.hoang, mcgrof, ritesh.list, ojaswin

On 03/10/2024 14:02, Christoph Hellwig wrote:
> On Thu, Oct 03, 2024 at 01:48:41PM +0100, John Garry wrote:
>> On 30/09/2024 13:54, John Garry wrote:
>>> @@ -352,11 +352,15 @@ xfs_sb_has_compat_feature(
>>>    #define XFS_SB_FEAT_RO_COMPAT_RMAPBT   (1 << 1)		/* reverse map btree */
>>>    #define XFS_SB_FEAT_RO_COMPAT_REFLINK  (1 << 2)		/* reflinked files */
>>>    #define XFS_SB_FEAT_RO_COMPAT_INOBTCNT (1 << 3)		/* inobt block counts */
>>> +#define XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES (1 << 31)	/* atomicwrites enabled */
>>> +
>> BTW, Darrick, as you questioned previously, this does make xfs/270 fail...
>> until the change to a not use the top bit.
> With the large block size based atomic writes we shoudn't even need
> a feature flag, or am I missing something?

It really does not add much ATM.

It originated back with you mentioned that we need a generic common way 
to enable atomic writes.

And then we had forcealign, which was tightly-coupled to enabling atomic 
writes, e.g. enforce forcealign enabled and power-of-2 extsize.

But, apart from leveraging larger FS block size for atomic writes, I 
still think that we will want forcealign or similar.

I don't have full tests results yet, but we already know that we see 
performance degradation in some scenarios when going to a 4K -> 16K FS 
block size. We're testing MySQL, and Redo Log performance/efficiency 
significantly degrades with 16K FS block size.

Thanks,
John

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2024-10-03 13:19 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-30 12:54 [PATCH v6 0/7] block atomic writes for xfs John Garry
2024-09-30 12:54 ` [PATCH v6 1/7] block/fs: Pass an iocb to generic_atomic_write_valid() John Garry
2024-10-01  8:39   ` Christoph Hellwig
2024-09-30 12:54 ` [PATCH v6 2/7] fs: Export generic_atomic_write_valid() John Garry
2024-10-01  8:41   ` Christoph Hellwig
2024-09-30 12:54 ` [PATCH v6 3/7] fs: iomap: Atomic write support John Garry
2024-09-30 15:55   ` Darrick J. Wong
2024-10-01  8:05     ` John Garry
2024-10-01 14:37       ` Darrick J. Wong
2024-09-30 12:54 ` [PATCH v6 4/7] xfs: Support FS_XFLAG_ATOMICWRITES John Garry
2024-09-30 16:03   ` Darrick J. Wong
2024-09-30 16:44     ` Darrick J. Wong
2024-10-01  8:41     ` Christoph Hellwig
2024-10-01 12:05       ` John Garry
2024-10-01 13:35     ` John Garry
2024-10-03 12:48   ` John Garry
2024-10-03 13:02     ` Christoph Hellwig
2024-10-03 13:19       ` John Garry
2024-09-30 12:54 ` [PATCH v6 5/7] xfs: Support atomic write for statx John Garry
2024-09-30 16:37   ` Darrick J. Wong
2024-10-01  8:29     ` John Garry
2024-10-01  8:43     ` Christoph Hellwig
2024-09-30 12:54 ` [PATCH v6 6/7] xfs: Validate atomic writes John Garry
2024-09-30 16:41   ` Darrick J. Wong
2024-10-01 13:22     ` John Garry
2024-10-01 14:48       ` Darrick J. Wong
2024-10-01 15:48         ` John Garry
2024-09-30 12:54 ` [PATCH v6 7/7] xfs: Support setting FMODE_CAN_ATOMIC_WRITE John Garry
2024-09-30 16:41   ` Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).