[PATCH v5 0/7] block atomic writes for xfs

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v5 0/7] block atomic writes for xfs
@ 2024-08-17  9:47 John Garry
  2024-08-17  9:47 ` [PATCH v5 1/7] block/fs: Pass an iocb to generic_atomic_write_valid() John Garry
                   ` (6 more replies)
  0 siblings, 7 replies; 28+ messages in thread
From: John Garry @ 2024-08-17  9:47 UTC (permalink / raw)
  To: axboe, brauner, djwong, viro, jack, chandan.babu, dchinner, hch
  Cc: linux-block, linux-kernel, linux-xfs, linux-fsdevel, hare,
	martin.petersen, catherine.hoang, kbusch, John Garry

This series expands atomic write support to filesystems, specifically
XFS. Extent alignment is based on new feature forcealign.

Flag FS_XFLAG_ATOMICWRITES is added as an enabling flag for atomic writes.

XFS can be formatted for atomic writes as follows:
mkfs.xfs -i forcealign=1 -d extsize=16384 -d atomic-writes=1  /dev/sda

atomic-writes=1 just enables atomic writes in the SB, but does not auto-
enable atomic writes for each file.

Support can be enabled through xfs_io command:
$xfs_io -c "lsattr -v" filename
[extsize, force-align]
$xfs_io -c "extsize" filename
[16384] filename
$xfs_io -c "chattr +W" filename
$xfs_io -c "lsattr -v" filename
[extsize, force-align, atomic-writes] filename
$xfs_io -c statx filename
...
stat.stx_atomic_write_unit_min = 4096
stat.stx_atomic_write_unit_max = 16384
stat.stx_atomic_write_segments_max = 1
...


New in this series is a re-work of the iomap bio creation, in that we now
produce a single bio, which may cover multiple extents. Extent-based
zeroing is dropped.

Baseline is following series:
https://lore.kernel.org/linux-xfs/20240813163638.3751939-1-john.g.garry@oracle.com/T/#t

Basic xfsprogs support at:
https://github.com/johnpgarry/xfsprogs-dev/tree/forcealign_and_atomicwrites_for_v4_xfs_block_atomic_writes

Patches for this series can be found at:
https://github.com/johnpgarry/linux/tree/atomic-writes-v6.11-fs-v5

Changes since v4:
- Drop iomap extent-based zeroing and use single bio to cover multiple
  extents
- Move forcealign to another series
- Various change in ioctl, sb, inode validation
- Add patch to tweak generic_atomic_write_valid() API

Changes since v3:
https://lore.kernel.org/linux-xfs/20240429174746.2132161-1-john.g.garry@oracle.com/T/#m9424b3cd1ccfde795d04474fdb4456520b6b4242
- Only enforce forcealign extsize is power-of-2 for atomic writes
- Re-org some validation code
- Fix xfs_bmap_process_allocated_extent() for forcealign
- Support iomap->io_block_size and make each fs support it
- Add !power-of-2 iomap support for io_block_size
- Make iomap dio iter handle atomic write failure properly by zeroing the
  remaining io_block_size

John Garry (7):
  block/fs: Pass an iocb to generic_atomic_write_valid()
  fs: Export generic_atomic_write_valid()
  fs: iomap: Atomic write support
  xfs: Support FS_XFLAG_ATOMICWRITES for forcealign
  xfs: Support atomic write for statx
  xfs: Validate atomic writes
  xfs: Support setting FMODE_CAN_ATOMIC_WRITE

 block/fops.c                   |   8 +--
 fs/iomap/direct-io.c           | 122 ++++++++++++++++++++++++++++-----
 fs/iomap/trace.h               |   3 +-
 fs/read_write.c                |   5 +-
 fs/xfs/libxfs/xfs_format.h     |  11 ++-
 fs/xfs/libxfs/xfs_inode_buf.c  |  52 ++++++++++++++
 fs/xfs/libxfs/xfs_inode_util.c |   4 ++
 fs/xfs/libxfs/xfs_sb.c         |   2 +
 fs/xfs/xfs_buf.c               |  15 +++-
 fs/xfs/xfs_buf.h               |   4 +-
 fs/xfs/xfs_buf_mem.c           |   2 +-
 fs/xfs/xfs_file.c              |  27 +++++++-
 fs/xfs/xfs_inode.h             |   5 ++
 fs/xfs/xfs_ioctl.c             |  52 ++++++++++++++
 fs/xfs/xfs_iops.c              |  28 ++++++++
 fs/xfs/xfs_mount.h             |   2 +
 fs/xfs/xfs_super.c             |  12 ++++
 include/linux/fs.h             |   2 +-
 include/linux/iomap.h          |   1 +
 include/uapi/linux/fs.h        |   1 +
 20 files changed, 325 insertions(+), 33 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v5 1/7] block/fs: Pass an iocb to generic_atomic_write_valid()
  2024-08-17  9:47 [PATCH v5 0/7] block atomic writes for xfs John Garry
@ 2024-08-17  9:47 ` John Garry
  2024-08-20 17:28   ` Darrick J. Wong
  2024-08-17  9:47 ` [PATCH v5 2/7] fs: Export generic_atomic_write_valid() John Garry
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 28+ messages in thread
From: John Garry @ 2024-08-17  9:47 UTC (permalink / raw)
  To: axboe, brauner, djwong, viro, jack, chandan.babu, dchinner, hch
  Cc: linux-block, linux-kernel, linux-xfs, linux-fsdevel, hare,
	martin.petersen, catherine.hoang, kbusch, John Garry

Darrick and Hannes thought it better that generic_atomic_write_valid()
should be passed a struct iocb, and not just the member of that struct
which is referenced; see [0] and [1].

I think that makes a more generic and clean API, so make that change.

[0] https://lore.kernel.org/linux-block/680ce641-729b-4150-b875-531a98657682@suse.de/
[1] https://lore.kernel.org/linux-xfs/20240620212401.GA3058325@frogsfrogsfrogs/

Suggested-by: Darrick J. Wong <djwong@kernel.org>
Suggested-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 block/fops.c       | 8 ++++----
 fs/read_write.c    | 4 ++--
 include/linux/fs.h | 2 +-
 3 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/block/fops.c b/block/fops.c
index 9825c1713a49..1c0f9d313845 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -34,13 +34,13 @@ static blk_opf_t dio_bio_write_op(struct kiocb *iocb)
 	return opf;
 }
 
-static bool blkdev_dio_invalid(struct block_device *bdev, loff_t pos,
+static bool blkdev_dio_invalid(struct block_device *bdev, struct kiocb *iocb,
 				struct iov_iter *iter, bool is_atomic)
 {
-	if (is_atomic && !generic_atomic_write_valid(iter, pos))
+	if (is_atomic && !generic_atomic_write_valid(iocb, iter))
 		return true;
 
-	return pos & (bdev_logical_block_size(bdev) - 1) ||
+	return iocb->ki_pos & (bdev_logical_block_size(bdev) - 1) ||
 		!bdev_iter_is_aligned(bdev, iter);
 }
 
@@ -373,7 +373,7 @@ static ssize_t blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	if (!iov_iter_count(iter))
 		return 0;
 
-	if (blkdev_dio_invalid(bdev, iocb->ki_pos, iter, is_atomic))
+	if (blkdev_dio_invalid(bdev, iocb, iter, is_atomic))
 		return -EINVAL;
 
 	nr_pages = bio_iov_vecs_to_alloc(iter, BIO_MAX_VECS + 1);
diff --git a/fs/read_write.c b/fs/read_write.c
index 90e283b31ca1..d8af6f2f1c9a 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1737,7 +1737,7 @@ int generic_file_rw_checks(struct file *file_in, struct file *file_out)
 	return 0;
 }
 
-bool generic_atomic_write_valid(struct iov_iter *iter, loff_t pos)
+bool generic_atomic_write_valid(struct kiocb *iocb, struct iov_iter *iter)
 {
 	size_t len = iov_iter_count(iter);
 
@@ -1747,7 +1747,7 @@ bool generic_atomic_write_valid(struct iov_iter *iter, loff_t pos)
 	if (!is_power_of_2(len))
 		return false;
 
-	if (!IS_ALIGNED(pos, len))
+	if (!IS_ALIGNED(iocb->ki_pos, len))
 		return false;
 
 	return true;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index fd34b5755c0b..55d8b6beac7a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3657,6 +3657,6 @@ static inline bool vfs_empty_path(int dfd, const char __user *path)
 	return !c;
 }
 
-bool generic_atomic_write_valid(struct iov_iter *iter, loff_t pos);
+bool generic_atomic_write_valid(struct kiocb *iocb, struct iov_iter *iter);
 
 #endif /* _LINUX_FS_H */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 1/7] block/fs: Pass an iocb to generic_atomic_write_valid()
  2024-08-17  9:47 ` [PATCH v5 1/7] block/fs: Pass an iocb to generic_atomic_write_valid() John Garry
@ 2024-08-20 17:28   ` Darrick J. Wong
  0 siblings, 0 replies; 28+ messages in thread
From: Darrick J. Wong @ 2024-08-20 17:28 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, brauner, viro, jack, chandan.babu, dchinner, hch,
	linux-block, linux-kernel, linux-xfs, linux-fsdevel, hare,
	martin.petersen, catherine.hoang, kbusch

On Sat, Aug 17, 2024 at 09:47:54AM +0000, John Garry wrote:
> Darrick and Hannes thought it better that generic_atomic_write_valid()
> should be passed a struct iocb, and not just the member of that struct
> which is referenced; see [0] and [1].
> 
> I think that makes a more generic and clean API, so make that change.
> 
> [0] https://lore.kernel.org/linux-block/680ce641-729b-4150-b875-531a98657682@suse.de/
> [1] https://lore.kernel.org/linux-xfs/20240620212401.GA3058325@frogsfrogsfrogs/
> 
> Suggested-by: Darrick J. Wong <djwong@kernel.org>
> Suggested-by: Hannes Reinecke <hare@suse.de>
> Signed-off-by: John Garry <john.g.garry@oracle.com>

Thanks for doing this,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  block/fops.c       | 8 ++++----
>  fs/read_write.c    | 4 ++--
>  include/linux/fs.h | 2 +-
>  3 files changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/block/fops.c b/block/fops.c
> index 9825c1713a49..1c0f9d313845 100644
> --- a/block/fops.c
> +++ b/block/fops.c
> @@ -34,13 +34,13 @@ static blk_opf_t dio_bio_write_op(struct kiocb *iocb)
>  	return opf;
>  }
>  
> -static bool blkdev_dio_invalid(struct block_device *bdev, loff_t pos,
> +static bool blkdev_dio_invalid(struct block_device *bdev, struct kiocb *iocb,
>  				struct iov_iter *iter, bool is_atomic)
>  {
> -	if (is_atomic && !generic_atomic_write_valid(iter, pos))
> +	if (is_atomic && !generic_atomic_write_valid(iocb, iter))
>  		return true;
>  
> -	return pos & (bdev_logical_block_size(bdev) - 1) ||
> +	return iocb->ki_pos & (bdev_logical_block_size(bdev) - 1) ||
>  		!bdev_iter_is_aligned(bdev, iter);
>  }
>  
> @@ -373,7 +373,7 @@ static ssize_t blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>  	if (!iov_iter_count(iter))
>  		return 0;
>  
> -	if (blkdev_dio_invalid(bdev, iocb->ki_pos, iter, is_atomic))
> +	if (blkdev_dio_invalid(bdev, iocb, iter, is_atomic))
>  		return -EINVAL;
>  
>  	nr_pages = bio_iov_vecs_to_alloc(iter, BIO_MAX_VECS + 1);
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 90e283b31ca1..d8af6f2f1c9a 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1737,7 +1737,7 @@ int generic_file_rw_checks(struct file *file_in, struct file *file_out)
>  	return 0;
>  }
>  
> -bool generic_atomic_write_valid(struct iov_iter *iter, loff_t pos)
> +bool generic_atomic_write_valid(struct kiocb *iocb, struct iov_iter *iter)
>  {
>  	size_t len = iov_iter_count(iter);
>  
> @@ -1747,7 +1747,7 @@ bool generic_atomic_write_valid(struct iov_iter *iter, loff_t pos)
>  	if (!is_power_of_2(len))
>  		return false;
>  
> -	if (!IS_ALIGNED(pos, len))
> +	if (!IS_ALIGNED(iocb->ki_pos, len))
>  		return false;
>  
>  	return true;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index fd34b5755c0b..55d8b6beac7a 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -3657,6 +3657,6 @@ static inline bool vfs_empty_path(int dfd, const char __user *path)
>  	return !c;
>  }
>  
> -bool generic_atomic_write_valid(struct iov_iter *iter, loff_t pos);
> +bool generic_atomic_write_valid(struct kiocb *iocb, struct iov_iter *iter);
>  
>  #endif /* _LINUX_FS_H */
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v5 2/7] fs: Export generic_atomic_write_valid()
  2024-08-17  9:47 [PATCH v5 0/7] block atomic writes for xfs John Garry
  2024-08-17  9:47 ` [PATCH v5 1/7] block/fs: Pass an iocb to generic_atomic_write_valid() John Garry
@ 2024-08-17  9:47 ` John Garry
  2024-08-20 17:28   ` Darrick J. Wong
  2024-08-17  9:47 ` [PATCH v5 3/7] fs: iomap: Atomic write support John Garry
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 28+ messages in thread
From: John Garry @ 2024-08-17  9:47 UTC (permalink / raw)
  To: axboe, brauner, djwong, viro, jack, chandan.babu, dchinner, hch
  Cc: linux-block, linux-kernel, linux-xfs, linux-fsdevel, hare,
	martin.petersen, catherine.hoang, kbusch, John Garry

The XFS code will need this.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/read_write.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/read_write.c b/fs/read_write.c
index d8af6f2f1c9a..babc3673c22c 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1752,3 +1752,4 @@ bool generic_atomic_write_valid(struct kiocb *iocb, struct iov_iter *iter)
 
 	return true;
 }
+EXPORT_SYMBOL_GPL(generic_atomic_write_valid);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 2/7] fs: Export generic_atomic_write_valid()
  2024-08-17  9:47 ` [PATCH v5 2/7] fs: Export generic_atomic_write_valid() John Garry
@ 2024-08-20 17:28   ` Darrick J. Wong
  0 siblings, 0 replies; 28+ messages in thread
From: Darrick J. Wong @ 2024-08-20 17:28 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, brauner, viro, jack, chandan.babu, dchinner, hch,
	linux-block, linux-kernel, linux-xfs, linux-fsdevel, hare,
	martin.petersen, catherine.hoang, kbusch

On Sat, Aug 17, 2024 at 09:47:55AM +0000, John Garry wrote:
> The XFS code will need this.
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  fs/read_write.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/read_write.c b/fs/read_write.c
> index d8af6f2f1c9a..babc3673c22c 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1752,3 +1752,4 @@ bool generic_atomic_write_valid(struct kiocb *iocb, struct iov_iter *iter)
>  
>  	return true;
>  }
> +EXPORT_SYMBOL_GPL(generic_atomic_write_valid);

Looks great,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v5 3/7] fs: iomap: Atomic write support
  2024-08-17  9:47 [PATCH v5 0/7] block atomic writes for xfs John Garry
  2024-08-17  9:47 ` [PATCH v5 1/7] block/fs: Pass an iocb to generic_atomic_write_valid() John Garry
  2024-08-17  9:47 ` [PATCH v5 2/7] fs: Export generic_atomic_write_valid() John Garry
@ 2024-08-17  9:47 ` John Garry
  2024-08-21 16:58   ` Darrick J. Wong
  2024-08-17  9:47 ` [PATCH v5 4/7] xfs: Support FS_XFLAG_ATOMICWRITES for forcealign John Garry
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 28+ messages in thread
From: John Garry @ 2024-08-17  9:47 UTC (permalink / raw)
  To: axboe, brauner, djwong, viro, jack, chandan.babu, dchinner, hch
  Cc: linux-block, linux-kernel, linux-xfs, linux-fsdevel, hare,
	martin.petersen, catherine.hoang, kbusch, John Garry

Support direct I/O atomic writes by producing a single bio with REQ_ATOMIC
flag set.

We rely on the FS to guarantee extent alignment, such that an atomic write
should never straddle two or more extents. The FS should also check for
validity of an atomic write length/alignment.

For each iter, data is appended to the single bio. That bio is allocated
on the first iter.

If that total bio data does not match the expected total, then error and
do not submit the bio as we cannot tolerate a partial write.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/iomap/direct-io.c  | 122 +++++++++++++++++++++++++++++++++++-------
 fs/iomap/trace.h      |   3 +-
 include/linux/iomap.h |   1 +
 3 files changed, 106 insertions(+), 20 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index f3b43d223a46..72f28d53ab03 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -37,6 +37,7 @@ struct iomap_dio {
 	int			error;
 	size_t			done_before;
 	bool			wait_for_completion;
+	struct bio		*atomic_bio;
 
 	union {
 		/* used during submission and for synchronous completion: */
@@ -61,6 +62,24 @@ static struct bio *iomap_dio_alloc_bio(const struct iomap_iter *iter,
 	return bio_alloc(iter->iomap.bdev, nr_vecs, opf, GFP_KERNEL);
 }
 
+static struct bio *iomap_dio_alloc_bio_data(const struct iomap_iter *iter,
+		struct iomap_dio *dio, unsigned short nr_vecs, blk_opf_t opf,
+		loff_t pos)
+{
+	struct bio *bio = iomap_dio_alloc_bio(iter, dio, nr_vecs, opf);
+	struct inode *inode = iter->inode;
+
+	fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
+				  GFP_KERNEL);
+	bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos);
+	bio->bi_write_hint = inode->i_write_hint;
+	bio->bi_ioprio = dio->iocb->ki_ioprio;
+	bio->bi_private = dio;
+	bio->bi_end_io = iomap_dio_bio_end_io;
+
+	return bio;
+}
+
 static void iomap_dio_submit_bio(const struct iomap_iter *iter,
 		struct iomap_dio *dio, struct bio *bio, loff_t pos)
 {
@@ -256,7 +275,7 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
  * clearing the WRITE_THROUGH flag in the dio request.
  */
 static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
-		const struct iomap *iomap, bool use_fua)
+		const struct iomap *iomap, bool use_fua, bool atomic)
 {
 	blk_opf_t opflags = REQ_SYNC | REQ_IDLE;
 
@@ -268,6 +287,8 @@ static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
 		opflags |= REQ_FUA;
 	else
 		dio->flags &= ~IOMAP_DIO_WRITE_THROUGH;
+	if (atomic)
+		opflags |= REQ_ATOMIC;
 
 	return opflags;
 }
@@ -275,21 +296,23 @@ static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
 static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 		struct iomap_dio *dio)
 {
+	bool atomic = dio->iocb->ki_flags & IOCB_ATOMIC;
 	const struct iomap *iomap = &iter->iomap;
 	struct inode *inode = iter->inode;
 	unsigned int fs_block_size = i_blocksize(inode), pad;
+	struct iov_iter *i = dio->submit.iter;
 	loff_t length = iomap_length(iter);
 	loff_t pos = iter->pos;
 	blk_opf_t bio_opf;
 	struct bio *bio;
 	bool need_zeroout = false;
 	bool use_fua = false;
-	int nr_pages, ret = 0;
+	int nr_pages, orig_nr_pages, ret = 0;
 	size_t copied = 0;
 	size_t orig_count;
 
 	if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
-	    !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
+	    !bdev_iter_is_aligned(iomap->bdev, i))
 		return -EINVAL;
 
 	if (iomap->type == IOMAP_UNWRITTEN) {
@@ -322,15 +345,35 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 			dio->flags &= ~IOMAP_DIO_CALLER_COMP;
 	}
 
+	if (dio->atomic_bio) {
+		/*
+		 * These should not fail, but check just in case.
+		 * Caller takes care of freeing the bio.
+		 */
+		if (iter->iomap.bdev != dio->atomic_bio->bi_bdev) {
+			ret = -EINVAL;
+			goto out;
+		}
+
+		if (dio->atomic_bio->bi_iter.bi_sector +
+		    (dio->atomic_bio->bi_iter.bi_size >> SECTOR_SHIFT) !=
+			iomap_sector(iomap, pos)) {
+			ret = -EINVAL;
+			goto out;
+		}
+	} else if (atomic) {
+		orig_nr_pages = bio_iov_vecs_to_alloc(i, BIO_MAX_VECS);
+	}
+
 	/*
 	 * Save the original count and trim the iter to just the extent we
 	 * are operating on right now.  The iter will be re-expanded once
 	 * we are done.
 	 */
-	orig_count = iov_iter_count(dio->submit.iter);
-	iov_iter_truncate(dio->submit.iter, length);
+	orig_count = iov_iter_count(i);
+	iov_iter_truncate(i, length);
 
-	if (!iov_iter_count(dio->submit.iter))
+	if (!iov_iter_count(i))
 		goto out;
 
 	/*
@@ -365,27 +408,46 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 	 * can set up the page vector appropriately for a ZONE_APPEND
 	 * operation.
 	 */
-	bio_opf = iomap_dio_bio_opflags(dio, iomap, use_fua);
+	bio_opf = iomap_dio_bio_opflags(dio, iomap, use_fua, atomic);
+
+	if (atomic) {
+		size_t orig_atomic_size;
+
+		if (!dio->atomic_bio) {
+			dio->atomic_bio = iomap_dio_alloc_bio_data(iter,
+					dio, orig_nr_pages, bio_opf, pos);
+		}
+		orig_atomic_size = dio->atomic_bio->bi_iter.bi_size;
+
+		/*
+		 * In case of error, caller takes care of freeing the bio. The
+		 * smallest size of atomic write is i_node size, so no need for
+		 * tail zeroing out.
+		 */
+		ret = bio_iov_iter_get_pages(dio->atomic_bio, i);
+		if (!ret) {
+			copied = dio->atomic_bio->bi_iter.bi_size -
+				orig_atomic_size;
+		}
 
-	nr_pages = bio_iov_vecs_to_alloc(dio->submit.iter, BIO_MAX_VECS);
+		dio->size += copied;
+		goto out;
+	}
+
+	nr_pages = bio_iov_vecs_to_alloc(i, BIO_MAX_VECS);
 	do {
 		size_t n;
 		if (dio->error) {
-			iov_iter_revert(dio->submit.iter, copied);
+			iov_iter_revert(i, copied);
 			copied = ret = 0;
 			goto out;
 		}
 
-		bio = iomap_dio_alloc_bio(iter, dio, nr_pages, bio_opf);
+		bio = iomap_dio_alloc_bio_data(iter, dio, nr_pages, bio_opf, pos);
 		fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
 					  GFP_KERNEL);
-		bio->bi_iter.bi_sector = iomap_sector(iomap, pos);
-		bio->bi_write_hint = inode->i_write_hint;
-		bio->bi_ioprio = dio->iocb->ki_ioprio;
-		bio->bi_private = dio;
-		bio->bi_end_io = iomap_dio_bio_end_io;
 
-		ret = bio_iov_iter_get_pages(bio, dio->submit.iter);
+		ret = bio_iov_iter_get_pages(bio, i);
 		if (unlikely(ret)) {
 			/*
 			 * We have to stop part way through an IO. We must fall
@@ -408,8 +470,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 		dio->size += n;
 		copied += n;
 
-		nr_pages = bio_iov_vecs_to_alloc(dio->submit.iter,
-						 BIO_MAX_VECS);
+		nr_pages = bio_iov_vecs_to_alloc(i, BIO_MAX_VECS);
 		/*
 		 * We can only poll for single bio I/Os.
 		 */
@@ -435,7 +496,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 	}
 out:
 	/* Undo iter limitation to current extent */
-	iov_iter_reexpand(dio->submit.iter, orig_count - copied);
+	iov_iter_reexpand(i, orig_count - copied);
 	if (copied)
 		return copied;
 	return ret;
@@ -555,6 +616,7 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	struct blk_plug plug;
 	struct iomap_dio *dio;
 	loff_t ret = 0;
+	size_t orig_count = iov_iter_count(iter);
 
 	trace_iomap_dio_rw_begin(iocb, iter, dio_flags, done_before);
 
@@ -580,6 +642,13 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	if (iocb->ki_flags & IOCB_NOWAIT)
 		iomi.flags |= IOMAP_NOWAIT;
 
+	if (iocb->ki_flags & IOCB_ATOMIC) {
+		if (bio_iov_vecs_to_alloc(iter, INT_MAX) > BIO_MAX_VECS)
+			return ERR_PTR(-EINVAL);
+		iomi.flags |= IOMAP_ATOMIC;
+	}
+	dio->atomic_bio = NULL;
+
 	if (iov_iter_rw(iter) == READ) {
 		/* reads can always complete inline */
 		dio->flags |= IOMAP_DIO_INLINE_COMP;
@@ -665,6 +734,21 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		iocb->ki_flags &= ~IOCB_HIPRI;
 	}
 
+	if (iocb->ki_flags & IOCB_ATOMIC) {
+		if (ret >= 0) {
+			if (dio->size == orig_count) {
+				iomap_dio_submit_bio(&iomi, dio,
+					dio->atomic_bio, iocb->ki_pos);
+			} else {
+				if (dio->atomic_bio)
+					bio_put(dio->atomic_bio);
+				ret = -EINVAL;
+			}
+		} else if (dio->atomic_bio) {
+			bio_put(dio->atomic_bio);
+		}
+	}
+
 	blk_finish_plug(&plug);
 
 	/*
diff --git a/fs/iomap/trace.h b/fs/iomap/trace.h
index 0a991c4ce87d..4118a42cdab0 100644
--- a/fs/iomap/trace.h
+++ b/fs/iomap/trace.h
@@ -98,7 +98,8 @@ DEFINE_RANGE_EVENT(iomap_dio_rw_queued);
 	{ IOMAP_REPORT,		"REPORT" }, \
 	{ IOMAP_FAULT,		"FAULT" }, \
 	{ IOMAP_DIRECT,		"DIRECT" }, \
-	{ IOMAP_NOWAIT,		"NOWAIT" }
+	{ IOMAP_NOWAIT,		"NOWAIT" }, \
+	{ IOMAP_ATOMIC,		"ATOMIC" }
 
 #define IOMAP_F_FLAGS_STRINGS \
 	{ IOMAP_F_NEW,		"NEW" }, \
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 6fc1c858013d..8fd949442262 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -178,6 +178,7 @@ struct iomap_folio_ops {
 #else
 #define IOMAP_DAX		0
 #endif /* CONFIG_FS_DAX */
+#define IOMAP_ATOMIC		(1 << 9)
 
 struct iomap_ops {
 	/*
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 3/7] fs: iomap: Atomic write support
  2024-08-17  9:47 ` [PATCH v5 3/7] fs: iomap: Atomic write support John Garry
@ 2024-08-21 16:58   ` Darrick J. Wong
  2024-08-22 15:29     ` John Garry
  0 siblings, 1 reply; 28+ messages in thread
From: Darrick J. Wong @ 2024-08-21 16:58 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, brauner, viro, jack, chandan.babu, dchinner, hch,
	linux-block, linux-kernel, linux-xfs, linux-fsdevel, hare,
	martin.petersen, catherine.hoang, kbusch

On Sat, Aug 17, 2024 at 09:47:56AM +0000, John Garry wrote:
> Support direct I/O atomic writes by producing a single bio with REQ_ATOMIC
> flag set.
> 
> We rely on the FS to guarantee extent alignment, such that an atomic write
> should never straddle two or more extents. The FS should also check for
> validity of an atomic write length/alignment.
> 
> For each iter, data is appended to the single bio. That bio is allocated
> on the first iter.
> 
> If that total bio data does not match the expected total, then error and
> do not submit the bio as we cannot tolerate a partial write.
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  fs/iomap/direct-io.c  | 122 +++++++++++++++++++++++++++++++++++-------
>  fs/iomap/trace.h      |   3 +-
>  include/linux/iomap.h |   1 +
>  3 files changed, 106 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index f3b43d223a46..72f28d53ab03 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -37,6 +37,7 @@ struct iomap_dio {
>  	int			error;
>  	size_t			done_before;
>  	bool			wait_for_completion;
> +	struct bio		*atomic_bio;
>  
>  	union {
>  		/* used during submission and for synchronous completion: */
> @@ -61,6 +62,24 @@ static struct bio *iomap_dio_alloc_bio(const struct iomap_iter *iter,
>  	return bio_alloc(iter->iomap.bdev, nr_vecs, opf, GFP_KERNEL);
>  }
>  
> +static struct bio *iomap_dio_alloc_bio_data(const struct iomap_iter *iter,
> +		struct iomap_dio *dio, unsigned short nr_vecs, blk_opf_t opf,
> +		loff_t pos)
> +{
> +	struct bio *bio = iomap_dio_alloc_bio(iter, dio, nr_vecs, opf);
> +	struct inode *inode = iter->inode;
> +
> +	fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
> +				  GFP_KERNEL);
> +	bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos);
> +	bio->bi_write_hint = inode->i_write_hint;
> +	bio->bi_ioprio = dio->iocb->ki_ioprio;
> +	bio->bi_private = dio;
> +	bio->bi_end_io = iomap_dio_bio_end_io;
> +
> +	return bio;
> +}
> +
>  static void iomap_dio_submit_bio(const struct iomap_iter *iter,
>  		struct iomap_dio *dio, struct bio *bio, loff_t pos)
>  {
> @@ -256,7 +275,7 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
>   * clearing the WRITE_THROUGH flag in the dio request.
>   */
>  static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
> -		const struct iomap *iomap, bool use_fua)
> +		const struct iomap *iomap, bool use_fua, bool atomic)
>  {
>  	blk_opf_t opflags = REQ_SYNC | REQ_IDLE;
>  
> @@ -268,6 +287,8 @@ static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
>  		opflags |= REQ_FUA;
>  	else
>  		dio->flags &= ~IOMAP_DIO_WRITE_THROUGH;
> +	if (atomic)
> +		opflags |= REQ_ATOMIC;
>  
>  	return opflags;
>  }
> @@ -275,21 +296,23 @@ static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
>  static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>  		struct iomap_dio *dio)
>  {
> +	bool atomic = dio->iocb->ki_flags & IOCB_ATOMIC;
>  	const struct iomap *iomap = &iter->iomap;
>  	struct inode *inode = iter->inode;
>  	unsigned int fs_block_size = i_blocksize(inode), pad;
> +	struct iov_iter *i = dio->submit.iter;

If you're going to pull this out into a convenience variable, please do
that as a separate patch so that the actual untorn write additions don't
get mixed in.

>  	loff_t length = iomap_length(iter);
>  	loff_t pos = iter->pos;
>  	blk_opf_t bio_opf;
>  	struct bio *bio;
>  	bool need_zeroout = false;
>  	bool use_fua = false;
> -	int nr_pages, ret = 0;
> +	int nr_pages, orig_nr_pages, ret = 0;
>  	size_t copied = 0;
>  	size_t orig_count;
>  
>  	if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
> -	    !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
> +	    !bdev_iter_is_aligned(iomap->bdev, i))
>  		return -EINVAL;
>  
>  	if (iomap->type == IOMAP_UNWRITTEN) {
> @@ -322,15 +345,35 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>  			dio->flags &= ~IOMAP_DIO_CALLER_COMP;
>  	}
>  
> +	if (dio->atomic_bio) {
> +		/*
> +		 * These should not fail, but check just in case.
> +		 * Caller takes care of freeing the bio.
> +		 */
> +		if (iter->iomap.bdev != dio->atomic_bio->bi_bdev) {
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +
> +		if (dio->atomic_bio->bi_iter.bi_sector +
> +		    (dio->atomic_bio->bi_iter.bi_size >> SECTOR_SHIFT) !=

Hmm, so I guess you stash an untorn write bio in the iomap_dio so that
multiple iomap_dio_bio_iter can try to combine a mixed mapping into a
single contiguous untorn write that can be completed all at once?
I suppose that works as long as the iomap->type is the same across all
the _iter calls, but I think that needs explicit checking here.

> +			iomap_sector(iomap, pos)) {
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +	} else if (atomic) {
> +		orig_nr_pages = bio_iov_vecs_to_alloc(i, BIO_MAX_VECS);
> +	}
> +
>  	/*
>  	 * Save the original count and trim the iter to just the extent we
>  	 * are operating on right now.  The iter will be re-expanded once
>  	 * we are done.
>  	 */
> -	orig_count = iov_iter_count(dio->submit.iter);
> -	iov_iter_truncate(dio->submit.iter, length);
> +	orig_count = iov_iter_count(i);
> +	iov_iter_truncate(i, length);
>  
> -	if (!iov_iter_count(dio->submit.iter))
> +	if (!iov_iter_count(i))
>  		goto out;
>  
>  	/*
> @@ -365,27 +408,46 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>  	 * can set up the page vector appropriately for a ZONE_APPEND
>  	 * operation.
>  	 */
> -	bio_opf = iomap_dio_bio_opflags(dio, iomap, use_fua);
> +	bio_opf = iomap_dio_bio_opflags(dio, iomap, use_fua, atomic);
> +
> +	if (atomic) {
> +		size_t orig_atomic_size;
> +
> +		if (!dio->atomic_bio) {
> +			dio->atomic_bio = iomap_dio_alloc_bio_data(iter,
> +					dio, orig_nr_pages, bio_opf, pos);
> +		}
> +		orig_atomic_size = dio->atomic_bio->bi_iter.bi_size;
> +
> +		/*
> +		 * In case of error, caller takes care of freeing the bio. The
> +		 * smallest size of atomic write is i_node size, so no need for

What is "i_node size"?  Are you referring to i_blocksize?

> +		 * tail zeroing out.
> +		 */
> +		ret = bio_iov_iter_get_pages(dio->atomic_bio, i);
> +		if (!ret) {
> +			copied = dio->atomic_bio->bi_iter.bi_size -
> +				orig_atomic_size;
> +		}
>  
> -	nr_pages = bio_iov_vecs_to_alloc(dio->submit.iter, BIO_MAX_VECS);
> +		dio->size += copied;
> +		goto out;
> +	}
> +
> +	nr_pages = bio_iov_vecs_to_alloc(i, BIO_MAX_VECS);
>  	do {
>  		size_t n;
>  		if (dio->error) {
> -			iov_iter_revert(dio->submit.iter, copied);
> +			iov_iter_revert(i, copied);
>  			copied = ret = 0;
>  			goto out;
>  		}
>  
> -		bio = iomap_dio_alloc_bio(iter, dio, nr_pages, bio_opf);
> +		bio = iomap_dio_alloc_bio_data(iter, dio, nr_pages, bio_opf, pos);
>  		fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
>  					  GFP_KERNEL);
> -		bio->bi_iter.bi_sector = iomap_sector(iomap, pos);
> -		bio->bi_write_hint = inode->i_write_hint;
> -		bio->bi_ioprio = dio->iocb->ki_ioprio;
> -		bio->bi_private = dio;
> -		bio->bi_end_io = iomap_dio_bio_end_io;

I see two places (here and iomap_dio_zero) that allocate a bio and
perform some initialization of it.  Can you move the common pieces to
iomap_dio_alloc_bio instead of adding a iomap_dio_alloc_bio_data
variant, and move all that to a separate cleanup patch?

> -		ret = bio_iov_iter_get_pages(bio, dio->submit.iter);
> +		ret = bio_iov_iter_get_pages(bio, i);
>  		if (unlikely(ret)) {
>  			/*
>  			 * We have to stop part way through an IO. We must fall
> @@ -408,8 +470,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>  		dio->size += n;
>  		copied += n;
>  
> -		nr_pages = bio_iov_vecs_to_alloc(dio->submit.iter,
> -						 BIO_MAX_VECS);
> +		nr_pages = bio_iov_vecs_to_alloc(i, BIO_MAX_VECS);
>  		/*
>  		 * We can only poll for single bio I/Os.
>  		 */
> @@ -435,7 +496,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>  	}
>  out:
>  	/* Undo iter limitation to current extent */
> -	iov_iter_reexpand(dio->submit.iter, orig_count - copied);
> +	iov_iter_reexpand(i, orig_count - copied);
>  	if (copied)
>  		return copied;
>  	return ret;
> @@ -555,6 +616,7 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  	struct blk_plug plug;
>  	struct iomap_dio *dio;
>  	loff_t ret = 0;
> +	size_t orig_count = iov_iter_count(iter);
>  
>  	trace_iomap_dio_rw_begin(iocb, iter, dio_flags, done_before);
>  
> @@ -580,6 +642,13 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  	if (iocb->ki_flags & IOCB_NOWAIT)
>  		iomi.flags |= IOMAP_NOWAIT;
>  
> +	if (iocb->ki_flags & IOCB_ATOMIC) {
> +		if (bio_iov_vecs_to_alloc(iter, INT_MAX) > BIO_MAX_VECS)
> +			return ERR_PTR(-EINVAL);
> +		iomi.flags |= IOMAP_ATOMIC;
> +	}
> +	dio->atomic_bio = NULL;
> +
>  	if (iov_iter_rw(iter) == READ) {
>  		/* reads can always complete inline */
>  		dio->flags |= IOMAP_DIO_INLINE_COMP;
> @@ -665,6 +734,21 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  		iocb->ki_flags &= ~IOCB_HIPRI;
>  	}
>  
> +	if (iocb->ki_flags & IOCB_ATOMIC) {
> +		if (ret >= 0) {
> +			if (dio->size == orig_count) {
> +				iomap_dio_submit_bio(&iomi, dio,
> +					dio->atomic_bio, iocb->ki_pos);

Does this need to do task_io_account_write like regular direct writes
do?

> +			} else {
> +				if (dio->atomic_bio)
> +					bio_put(dio->atomic_bio);
> +				ret = -EINVAL;
> +			}
> +		} else if (dio->atomic_bio) {
> +			bio_put(dio->atomic_bio);

This ought to null out dio->atomic_bio to prevent accidental UAF.

--D

> +		}
> +	}
> +
>  	blk_finish_plug(&plug);
>  
>  	/*
> diff --git a/fs/iomap/trace.h b/fs/iomap/trace.h
> index 0a991c4ce87d..4118a42cdab0 100644
> --- a/fs/iomap/trace.h
> +++ b/fs/iomap/trace.h
> @@ -98,7 +98,8 @@ DEFINE_RANGE_EVENT(iomap_dio_rw_queued);
>  	{ IOMAP_REPORT,		"REPORT" }, \
>  	{ IOMAP_FAULT,		"FAULT" }, \
>  	{ IOMAP_DIRECT,		"DIRECT" }, \
> -	{ IOMAP_NOWAIT,		"NOWAIT" }
> +	{ IOMAP_NOWAIT,		"NOWAIT" }, \
> +	{ IOMAP_ATOMIC,		"ATOMIC" }
>  
>  #define IOMAP_F_FLAGS_STRINGS \
>  	{ IOMAP_F_NEW,		"NEW" }, \
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 6fc1c858013d..8fd949442262 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -178,6 +178,7 @@ struct iomap_folio_ops {
>  #else
>  #define IOMAP_DAX		0
>  #endif /* CONFIG_FS_DAX */
> +#define IOMAP_ATOMIC		(1 << 9)
>  
>  struct iomap_ops {
>  	/*
> -- 
> 2.31.1
> 
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 3/7] fs: iomap: Atomic write support
  2024-08-21 16:58   ` Darrick J. Wong
@ 2024-08-22 15:29     ` John Garry
  2024-08-22 20:30       ` Darrick J. Wong
  0 siblings, 1 reply; 28+ messages in thread
From: John Garry @ 2024-08-22 15:29 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: axboe, brauner, viro, jack, chandan.babu, dchinner, hch,
	linux-block, linux-kernel, linux-xfs, linux-fsdevel, hare,
	martin.petersen, catherine.hoang, kbusch


>> +
>>   static void iomap_dio_submit_bio(const struct iomap_iter *iter,
>>   		struct iomap_dio *dio, struct bio *bio, loff_t pos)
>>   {
>> @@ -256,7 +275,7 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
>>    * clearing the WRITE_THROUGH flag in the dio request.
>>    */
>>   static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
>> -		const struct iomap *iomap, bool use_fua)
>> +		const struct iomap *iomap, bool use_fua, bool atomic)
>>   {
>>   	blk_opf_t opflags = REQ_SYNC | REQ_IDLE;
>>   
>> @@ -268,6 +287,8 @@ static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
>>   		opflags |= REQ_FUA;
>>   	else
>>   		dio->flags &= ~IOMAP_DIO_WRITE_THROUGH;
>> +	if (atomic)
>> +		opflags |= REQ_ATOMIC;
>>   
>>   	return opflags;
>>   }
>> @@ -275,21 +296,23 @@ static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
>>   static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>>   		struct iomap_dio *dio)
>>   {
>> +	bool atomic = dio->iocb->ki_flags & IOCB_ATOMIC;
>>   	const struct iomap *iomap = &iter->iomap;
>>   	struct inode *inode = iter->inode;
>>   	unsigned int fs_block_size = i_blocksize(inode), pad;
>> +	struct iov_iter *i = dio->submit.iter;
> 
> If you're going to pull this out into a convenience variable, please do
> that as a separate patch so that the actual untorn write additions don't
> get mixed in.

Yeah, I was thinking of doing that, so ok.

> 
>>   	loff_t length = iomap_length(iter);
>>   	loff_t pos = iter->pos;
>>   	blk_opf_t bio_opf;
>>   	struct bio *bio;
>>   	bool need_zeroout = false;
>>   	bool use_fua = false;
>> -	int nr_pages, ret = 0;
>> +	int nr_pages, orig_nr_pages, ret = 0;
>>   	size_t copied = 0;
>>   	size_t orig_count;
>>   
>>   	if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
>> -	    !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
>> +	    !bdev_iter_is_aligned(iomap->bdev, i))
>>   		return -EINVAL;
>>   
>>   	if (iomap->type == IOMAP_UNWRITTEN) {
>> @@ -322,15 +345,35 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>>   			dio->flags &= ~IOMAP_DIO_CALLER_COMP;
>>   	}
>>   
>> +	if (dio->atomic_bio) {
>> +		/*
>> +		 * These should not fail, but check just in case.
>> +		 * Caller takes care of freeing the bio.
>> +		 */
>> +		if (iter->iomap.bdev != dio->atomic_bio->bi_bdev) {
>> +			ret = -EINVAL;
>> +			goto out;
>> +		}
>> +
>> +		if (dio->atomic_bio->bi_iter.bi_sector +
>> +		    (dio->atomic_bio->bi_iter.bi_size >> SECTOR_SHIFT) !=
> 
> Hmm, so I guess you stash an untorn write bio in the iomap_dio so that
> multiple iomap_dio_bio_iter can try to combine a mixed mapping into a
> single contiguous untorn write that can be completed all at once?

Right, we are writing to a contiguous LBA address range with a single 
bio that happens to cover many different extents.

> I suppose that works as long as the iomap->type is the same across all
> the _iter calls, but I think that needs explicit checking here.

As an sample, if we try to atomically write over the data in the 
following file:

# xfs_bmap -vvp mnt/file
mnt/file:
EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
   0: [0..127]:        hole                                   128
   1: [128..135]:      256..263          0 (256..263)           8 010000
   2: [136..143]:      264..271          0 (264..271)           8 000000
   3: [144..255]:      272..383          0 (272..383)         112 010000
FLAG Values:
    0100000 Shared extent
    0010000 Unwritten preallocated extent
    0001000 Doesn't begin on stripe unit
    0000100 Doesn't end   on stripe unit
    0000010 Doesn't begin on stripe width
    0000001 Doesn't end   on stripe width
#


Then, the iomap->type/flag is either IOMAP_UNWRITTEN/IOMAP_F_DIRTY or 
IOMAP_MAPPED/IOMAP_F_DIRTY per iter. So the type is not consistent. 
However we will set IOMAP_DIO_UNWRITTEN in dio->flags, so call 
xfs_dio_write_endio() -> xfs_iomap_write_unwritten() for the complete 
FSB range.

Do you see a problem with this?

Please see this also for some more background:
https://lore.kernel.org/linux-xfs/20240726171358.GA27612@lst.de/


> 
>> +			iomap_sector(iomap, pos)) {
>> +			ret = -EINVAL;
>> +			goto out;
>> +		}
>> +	} else if (atomic) {
>> +		orig_nr_pages = bio_iov_vecs_to_alloc(i, BIO_MAX_VECS);
>> +	}
>> +
>>   	/*
>>   	 * Save the original count and trim the iter to just the extent we
>>   	 * are operating on right now.  The iter will be re-expanded once
>>   	 * we are done.
>>   	 */
>> -	orig_count = iov_iter_count(dio->submit.iter);
>> -	iov_iter_truncate(dio->submit.iter, length);
>> +	orig_count = iov_iter_count(i);
>> +	iov_iter_truncate(i, length);
>>   
>> -	if (!iov_iter_count(dio->submit.iter))
>> +	if (!iov_iter_count(i))
>>   		goto out;
>>   
>>   	/*
>> @@ -365,27 +408,46 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>>   	 * can set up the page vector appropriately for a ZONE_APPEND
>>   	 * operation.
>>   	 */
>> -	bio_opf = iomap_dio_bio_opflags(dio, iomap, use_fua);
>> +	bio_opf = iomap_dio_bio_opflags(dio, iomap, use_fua, atomic);
>> +
>> +	if (atomic) {
>> +		size_t orig_atomic_size;
>> +
>> +		if (!dio->atomic_bio) {
>> +			dio->atomic_bio = iomap_dio_alloc_bio_data(iter,
>> +					dio, orig_nr_pages, bio_opf, pos);
>> +		}
>> +		orig_atomic_size = dio->atomic_bio->bi_iter.bi_size;
>> +
>> +		/*
>> +		 * In case of error, caller takes care of freeing the bio. The
>> +		 * smallest size of atomic write is i_node size, so no need for
> 
> What is "i_node size"?  Are you referring to i_blocksize?

Yes, I meant i_blocksize()

> 
>> +		 * tail zeroing out.
>> +		 */
>> +		ret = bio_iov_iter_get_pages(dio->atomic_bio, i);
>> +		if (!ret) {
>> +			copied = dio->atomic_bio->bi_iter.bi_size -
>> +				orig_atomic_size;
>> +		}
>>   
>> -	nr_pages = bio_iov_vecs_to_alloc(dio->submit.iter, BIO_MAX_VECS);
>> +		dio->size += copied;
>> +		goto out;
>> +	}
>> +
>> +	nr_pages = bio_iov_vecs_to_alloc(i, BIO_MAX_VECS);
>>   	do {
>>   		size_t n;
>>   		if (dio->error) {
>> -			iov_iter_revert(dio->submit.iter, copied);
>> +			iov_iter_revert(i, copied);
>>   			copied = ret = 0;
>>   			goto out;
>>   		}
>>   
>> -		bio = iomap_dio_alloc_bio(iter, dio, nr_pages, bio_opf);
>> +		bio = iomap_dio_alloc_bio_data(iter, dio, nr_pages, bio_opf, pos);
>>   		fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
>>   					  GFP_KERNEL);
>> -		bio->bi_iter.bi_sector = iomap_sector(iomap, pos);
>> -		bio->bi_write_hint = inode->i_write_hint;
>> -		bio->bi_ioprio = dio->iocb->ki_ioprio;
>> -		bio->bi_private = dio;
>> -		bio->bi_end_io = iomap_dio_bio_end_io;
> 
> I see two places (here and iomap_dio_zero) that allocate a bio and
> perform some initialization of it.  Can you move the common pieces to
> iomap_dio_alloc_bio instead of adding a iomap_dio_alloc_bio_data
> variant, and move all that to a separate cleanup patch?

Sure

So can it cause harm if we set bio->bi_write_hint and ->bi_ioprio with 
the same values as iomap_dio_alloc_bio() for iomap_dio_zero()? If no, 
this would help make all the bio alloc code common

> 
>> -		ret = bio_iov_iter_get_pages(bio, dio->submit.iter);
>> +		ret = bio_iov_iter_get_pages(bio, i);
>>   		if (unlikely(ret)) {
>>   			/*
>>   			 * We have to stop part way through an IO. We must fall
>> @@ -408,8 +470,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>>   		dio->size += n;
>>   		copied += n;
>>   
>> -		nr_pages = bio_iov_vecs_to_alloc(dio->submit.iter,
>> -						 BIO_MAX_VECS);
>> +		nr_pages = bio_iov_vecs_to_alloc(i, BIO_MAX_VECS);
>>   		/*
>>   		 * We can only poll for single bio I/Os.
>>   		 */
>> @@ -435,7 +496,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>>   	}
>>   out:
>>   	/* Undo iter limitation to current extent */
>> -	iov_iter_reexpand(dio->submit.iter, orig_count - copied);
>> +	iov_iter_reexpand(i, orig_count - copied);
>>   	if (copied)
>>   		return copied;
>>   	return ret;
>> @@ -555,6 +616,7 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>>   	struct blk_plug plug;
>>   	struct iomap_dio *dio;
>>   	loff_t ret = 0;
>> +	size_t orig_count = iov_iter_count(iter);
>>   
>>   	trace_iomap_dio_rw_begin(iocb, iter, dio_flags, done_before);
>>   
>> @@ -580,6 +642,13 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>>   	if (iocb->ki_flags & IOCB_NOWAIT)
>>   		iomi.flags |= IOMAP_NOWAIT;
>>   
>> +	if (iocb->ki_flags & IOCB_ATOMIC) {
>> +		if (bio_iov_vecs_to_alloc(iter, INT_MAX) > BIO_MAX_VECS)
>> +			return ERR_PTR(-EINVAL);
>> +		iomi.flags |= IOMAP_ATOMIC;
>> +	}
>> +	dio->atomic_bio = NULL;
>> +
>>   	if (iov_iter_rw(iter) == READ) {
>>   		/* reads can always complete inline */
>>   		dio->flags |= IOMAP_DIO_INLINE_COMP;
>> @@ -665,6 +734,21 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>>   		iocb->ki_flags &= ~IOCB_HIPRI;
>>   	}
>>   
>> +	if (iocb->ki_flags & IOCB_ATOMIC) {
>> +		if (ret >= 0) {
>> +			if (dio->size == orig_count) {
>> +				iomap_dio_submit_bio(&iomi, dio,
>> +					dio->atomic_bio, iocb->ki_pos);
> 
> Does this need to do task_io_account_write like regular direct writes
> do?

yes, I missed that, will fix

> 
>> +			} else {
>> +				if (dio->atomic_bio)
>> +					bio_put(dio->atomic_bio);
>> +				ret = -EINVAL;
>> +			}
>> +		} else if (dio->atomic_bio) {
>> +			bio_put(dio->atomic_bio);
> 
> This ought to null out dio->atomic_bio to prevent accidental UAF.

ok, fine

Thanks,
John

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 3/7] fs: iomap: Atomic write support
  2024-08-22 15:29     ` John Garry
@ 2024-08-22 20:30       ` Darrick J. Wong
  2024-08-30 15:48         ` John Garry
  0 siblings, 1 reply; 28+ messages in thread
From: Darrick J. Wong @ 2024-08-22 20:30 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, brauner, viro, jack, chandan.babu, dchinner, hch,
	linux-block, linux-kernel, linux-xfs, linux-fsdevel, hare,
	martin.petersen, catherine.hoang, kbusch

On Thu, Aug 22, 2024 at 04:29:34PM +0100, John Garry wrote:
> 
> > > +
> > >   static void iomap_dio_submit_bio(const struct iomap_iter *iter,
> > >   		struct iomap_dio *dio, struct bio *bio, loff_t pos)
> > >   {
> > > @@ -256,7 +275,7 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
> > >    * clearing the WRITE_THROUGH flag in the dio request.
> > >    */
> > >   static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
> > > -		const struct iomap *iomap, bool use_fua)
> > > +		const struct iomap *iomap, bool use_fua, bool atomic)
> > >   {
> > >   	blk_opf_t opflags = REQ_SYNC | REQ_IDLE;
> > > @@ -268,6 +287,8 @@ static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
> > >   		opflags |= REQ_FUA;
> > >   	else
> > >   		dio->flags &= ~IOMAP_DIO_WRITE_THROUGH;
> > > +	if (atomic)
> > > +		opflags |= REQ_ATOMIC;
> > >   	return opflags;
> > >   }
> > > @@ -275,21 +296,23 @@ static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
> > >   static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
> > >   		struct iomap_dio *dio)
> > >   {
> > > +	bool atomic = dio->iocb->ki_flags & IOCB_ATOMIC;
> > >   	const struct iomap *iomap = &iter->iomap;
> > >   	struct inode *inode = iter->inode;
> > >   	unsigned int fs_block_size = i_blocksize(inode), pad;
> > > +	struct iov_iter *i = dio->submit.iter;
> > 
> > If you're going to pull this out into a convenience variable, please do
> > that as a separate patch so that the actual untorn write additions don't
> > get mixed in.
> 
> Yeah, I was thinking of doing that, so ok.
> 
> > 
> > >   	loff_t length = iomap_length(iter);
> > >   	loff_t pos = iter->pos;
> > >   	blk_opf_t bio_opf;
> > >   	struct bio *bio;
> > >   	bool need_zeroout = false;
> > >   	bool use_fua = false;
> > > -	int nr_pages, ret = 0;
> > > +	int nr_pages, orig_nr_pages, ret = 0;
> > >   	size_t copied = 0;
> > >   	size_t orig_count;
> > >   	if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
> > > -	    !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
> > > +	    !bdev_iter_is_aligned(iomap->bdev, i))
> > >   		return -EINVAL;
> > >   	if (iomap->type == IOMAP_UNWRITTEN) {
> > > @@ -322,15 +345,35 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
> > >   			dio->flags &= ~IOMAP_DIO_CALLER_COMP;
> > >   	}
> > > +	if (dio->atomic_bio) {
> > > +		/*
> > > +		 * These should not fail, but check just in case.
> > > +		 * Caller takes care of freeing the bio.
> > > +		 */
> > > +		if (iter->iomap.bdev != dio->atomic_bio->bi_bdev) {
> > > +			ret = -EINVAL;
> > > +			goto out;
> > > +		}
> > > +
> > > +		if (dio->atomic_bio->bi_iter.bi_sector +
> > > +		    (dio->atomic_bio->bi_iter.bi_size >> SECTOR_SHIFT) !=
> > 
> > Hmm, so I guess you stash an untorn write bio in the iomap_dio so that
> > multiple iomap_dio_bio_iter can try to combine a mixed mapping into a
> > single contiguous untorn write that can be completed all at once?
> 
> Right, we are writing to a contiguous LBA address range with a single bio
> that happens to cover many different extents.
> 
> > I suppose that works as long as the iomap->type is the same across all
> > the _iter calls, but I think that needs explicit checking here.
> 
> As an sample, if we try to atomically write over the data in the following
> file:
> 
> # xfs_bmap -vvp mnt/file
> mnt/file:
> EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
>   0: [0..127]:        hole                                   128
>   1: [128..135]:      256..263          0 (256..263)           8 010000
>   2: [136..143]:      264..271          0 (264..271)           8 000000
>   3: [144..255]:      272..383          0 (272..383)         112 010000
> FLAG Values:
>    0100000 Shared extent
>    0010000 Unwritten preallocated extent
>    0001000 Doesn't begin on stripe unit
>    0000100 Doesn't end   on stripe unit
>    0000010 Doesn't begin on stripe width
>    0000001 Doesn't end   on stripe width
> #
> 
> 
> Then, the iomap->type/flag is either IOMAP_UNWRITTEN/IOMAP_F_DIRTY or
> IOMAP_MAPPED/IOMAP_F_DIRTY per iter. So the type is not consistent. However
> we will set IOMAP_DIO_UNWRITTEN in dio->flags, so call xfs_dio_write_endio()
> -> xfs_iomap_write_unwritten() for the complete FSB range.
> 
> Do you see a problem with this?
> 
> Please see this also for some more background:
> https://lore.kernel.org/linux-xfs/20240726171358.GA27612@lst.de/

Yes -- if you have a mix of written and unwritten blocks for the same
chunk of physical space:

0      7
WUWUWUWU

the directio ioend function will start four separate transactions to
convert blocks 1, 3, 5, and 7 to written status.  If the system crashes
midway through, they will see this afterwards:

WWWWW0W0

IOWs, although the *disk write* was completed successfully, the mapping
updates were torn, and the user program sees a torn write.

The most performant/painful way to fix this would be to make the whole
ioend completion a logged operation so that we could commit to updating
all the unwritten mappings and restart it after a crash.

The least performant of course is to write zeroes at allocation time,
like we do for fsdax.

A possible middle ground would be to detect IOMAP_ATOMIC in the
->iomap_begin method, notice that there are mixed mappings under the
proposed untorn IO, and pre-convert the unwritten blocks by writing
zeroes to disk and updating the mappings before handing the one single
mapping back to iomap_dio_rw to stage the untorn writes bio.  At least
you'd only be suffering that penalty for the (probable) corner case of
someone creating mixed mappings.

> 
> > 
> > > +			iomap_sector(iomap, pos)) {
> > > +			ret = -EINVAL;
> > > +			goto out;
> > > +		}
> > > +	} else if (atomic) {
> > > +		orig_nr_pages = bio_iov_vecs_to_alloc(i, BIO_MAX_VECS);
> > > +	}
> > > +
> > >   	/*
> > >   	 * Save the original count and trim the iter to just the extent we
> > >   	 * are operating on right now.  The iter will be re-expanded once
> > >   	 * we are done.
> > >   	 */
> > > -	orig_count = iov_iter_count(dio->submit.iter);
> > > -	iov_iter_truncate(dio->submit.iter, length);
> > > +	orig_count = iov_iter_count(i);
> > > +	iov_iter_truncate(i, length);
> > > -	if (!iov_iter_count(dio->submit.iter))
> > > +	if (!iov_iter_count(i))
> > >   		goto out;
> > >   	/*
> > > @@ -365,27 +408,46 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
> > >   	 * can set up the page vector appropriately for a ZONE_APPEND
> > >   	 * operation.
> > >   	 */
> > > -	bio_opf = iomap_dio_bio_opflags(dio, iomap, use_fua);
> > > +	bio_opf = iomap_dio_bio_opflags(dio, iomap, use_fua, atomic);
> > > +
> > > +	if (atomic) {
> > > +		size_t orig_atomic_size;
> > > +
> > > +		if (!dio->atomic_bio) {
> > > +			dio->atomic_bio = iomap_dio_alloc_bio_data(iter,
> > > +					dio, orig_nr_pages, bio_opf, pos);
> > > +		}
> > > +		orig_atomic_size = dio->atomic_bio->bi_iter.bi_size;
> > > +
> > > +		/*
> > > +		 * In case of error, caller takes care of freeing the bio. The
> > > +		 * smallest size of atomic write is i_node size, so no need for
> > 
> > What is "i_node size"?  Are you referring to i_blocksize?
> 
> Yes, I meant i_blocksize()
> 
> > 
> > > +		 * tail zeroing out.
> > > +		 */
> > > +		ret = bio_iov_iter_get_pages(dio->atomic_bio, i);
> > > +		if (!ret) {
> > > +			copied = dio->atomic_bio->bi_iter.bi_size -
> > > +				orig_atomic_size;
> > > +		}
> > > -	nr_pages = bio_iov_vecs_to_alloc(dio->submit.iter, BIO_MAX_VECS);
> > > +		dio->size += copied;
> > > +		goto out;
> > > +	}
> > > +
> > > +	nr_pages = bio_iov_vecs_to_alloc(i, BIO_MAX_VECS);
> > >   	do {
> > >   		size_t n;
> > >   		if (dio->error) {
> > > -			iov_iter_revert(dio->submit.iter, copied);
> > > +			iov_iter_revert(i, copied);
> > >   			copied = ret = 0;
> > >   			goto out;
> > >   		}
> > > -		bio = iomap_dio_alloc_bio(iter, dio, nr_pages, bio_opf);
> > > +		bio = iomap_dio_alloc_bio_data(iter, dio, nr_pages, bio_opf, pos);
> > >   		fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
> > >   					  GFP_KERNEL);
> > > -		bio->bi_iter.bi_sector = iomap_sector(iomap, pos);
> > > -		bio->bi_write_hint = inode->i_write_hint;
> > > -		bio->bi_ioprio = dio->iocb->ki_ioprio;
> > > -		bio->bi_private = dio;
> > > -		bio->bi_end_io = iomap_dio_bio_end_io;
> > 
> > I see two places (here and iomap_dio_zero) that allocate a bio and
> > perform some initialization of it.  Can you move the common pieces to
> > iomap_dio_alloc_bio instead of adding a iomap_dio_alloc_bio_data
> > variant, and move all that to a separate cleanup patch?
> 
> Sure
> 
> So can it cause harm if we set bio->bi_write_hint and ->bi_ioprio with the
> same values as iomap_dio_alloc_bio() for iomap_dio_zero()? If no, this would
> help make all the bio alloc code common

I'd leave the bi_write_hint and bi_ioprio initialization out of the
common function.

--D

> > 
> > > -		ret = bio_iov_iter_get_pages(bio, dio->submit.iter);
> > > +		ret = bio_iov_iter_get_pages(bio, i);
> > >   		if (unlikely(ret)) {
> > >   			/*
> > >   			 * We have to stop part way through an IO. We must fall
> > > @@ -408,8 +470,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
> > >   		dio->size += n;
> > >   		copied += n;
> > > -		nr_pages = bio_iov_vecs_to_alloc(dio->submit.iter,
> > > -						 BIO_MAX_VECS);
> > > +		nr_pages = bio_iov_vecs_to_alloc(i, BIO_MAX_VECS);
> > >   		/*
> > >   		 * We can only poll for single bio I/Os.
> > >   		 */
> > > @@ -435,7 +496,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
> > >   	}
> > >   out:
> > >   	/* Undo iter limitation to current extent */
> > > -	iov_iter_reexpand(dio->submit.iter, orig_count - copied);
> > > +	iov_iter_reexpand(i, orig_count - copied);
> > >   	if (copied)
> > >   		return copied;
> > >   	return ret;
> > > @@ -555,6 +616,7 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> > >   	struct blk_plug plug;
> > >   	struct iomap_dio *dio;
> > >   	loff_t ret = 0;
> > > +	size_t orig_count = iov_iter_count(iter);
> > >   	trace_iomap_dio_rw_begin(iocb, iter, dio_flags, done_before);
> > > @@ -580,6 +642,13 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> > >   	if (iocb->ki_flags & IOCB_NOWAIT)
> > >   		iomi.flags |= IOMAP_NOWAIT;
> > > +	if (iocb->ki_flags & IOCB_ATOMIC) {
> > > +		if (bio_iov_vecs_to_alloc(iter, INT_MAX) > BIO_MAX_VECS)
> > > +			return ERR_PTR(-EINVAL);
> > > +		iomi.flags |= IOMAP_ATOMIC;
> > > +	}
> > > +	dio->atomic_bio = NULL;
> > > +
> > >   	if (iov_iter_rw(iter) == READ) {
> > >   		/* reads can always complete inline */
> > >   		dio->flags |= IOMAP_DIO_INLINE_COMP;
> > > @@ -665,6 +734,21 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> > >   		iocb->ki_flags &= ~IOCB_HIPRI;
> > >   	}
> > > +	if (iocb->ki_flags & IOCB_ATOMIC) {
> > > +		if (ret >= 0) {
> > > +			if (dio->size == orig_count) {
> > > +				iomap_dio_submit_bio(&iomi, dio,
> > > +					dio->atomic_bio, iocb->ki_pos);
> > 
> > Does this need to do task_io_account_write like regular direct writes
> > do?
> 
> yes, I missed that, will fix
> 
> > 
> > > +			} else {
> > > +				if (dio->atomic_bio)
> > > +					bio_put(dio->atomic_bio);
> > > +				ret = -EINVAL;
> > > +			}
> > > +		} else if (dio->atomic_bio) {
> > > +			bio_put(dio->atomic_bio);
> > 
> > This ought to null out dio->atomic_bio to prevent accidental UAF.
> 
> ok, fine
> 
> Thanks,
> John
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 3/7] fs: iomap: Atomic write support
  2024-08-22 20:30       ` Darrick J. Wong
@ 2024-08-30 15:48         ` John Garry
  2024-08-30 23:56           ` Darrick J. Wong
  0 siblings, 1 reply; 28+ messages in thread
From: John Garry @ 2024-08-30 15:48 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: axboe, brauner, viro, jack, chandan.babu, dchinner, hch,
	linux-block, linux-kernel, linux-xfs, linux-fsdevel, hare,
	martin.petersen, catherine.hoang, kbusch

On 22/08/2024 21:30, Darrick J. Wong wrote:
>> Then, the iomap->type/flag is either IOMAP_UNWRITTEN/IOMAP_F_DIRTY or
>> IOMAP_MAPPED/IOMAP_F_DIRTY per iter. So the type is not consistent. However
>> we will set IOMAP_DIO_UNWRITTEN in dio->flags, so call xfs_dio_write_endio()
>> -> xfs_iomap_write_unwritten() for the complete FSB range.
>>
>> Do you see a problem with this?

Sorry again for the slow response.

>>
>> Please see this also for some more background:
>> https://urldefense.com/v3/__https://lore.kernel.org/linux- 
>> xfs/20240726171358.GA27612@lst.de/__;!!ACWV5N9M2RV99hQ! 
>> P5jeP96F8wAtRAblbm8NvRo8nlpil03vA26UMMX8qrYa4IzKecAAk7x1l1M45bBshC3Czxn1CkDXypNSAg$ 
> Yes -- if you have a mix of written and unwritten blocks for the same
> chunk of physical space:
> 
> 0      7
> WUWUWUWU
> 
> the directio ioend function will start four separate transactions to
> convert blocks 1, 3, 5, and 7 to written status.  If the system crashes
> midway through, they will see this afterwards:
> 
> WWWWW0W0
> 
> IOWs, although the*disk write* was completed successfully, the mapping
> updates were torn, and the user program sees a torn write.
 > > The most performant/painful way to fix this would be to make the whole
> ioend completion a logged operation so that we could commit to updating
> all the unwritten mappings and restart it after a crash.

could we make it logged for those special cases which we are interested 
in only?

> 
> The least performant of course is to write zeroes at allocation time,
> like we do for fsdax.

That idea was already proposed:
https://lore.kernel.org/linux-xfs/ZcGIPlNCkL6EDx3Z@dread.disaster.area/

> 
> A possible middle ground would be to detect IOMAP_ATOMIC in the
> ->iomap_begin method, notice that there are mixed mappings under the
> proposed untorn IO, and pre-convert the unwritten blocks by writing
> zeroes to disk and updating the mappings 

Won't that have the same issue as using XFS_BMAPI_ZERO, above i.e. 
zeroing during allocation?

> before handing the one single
> mapping back to iomap_dio_rw to stage the untorn writes bio.  At least
> you'd only be suffering that penalty for the (probable) corner case of
> someone creating mixed mappings.

BTW, one issue I have with the sub-extent(or -alloc unit) zeroing from 
v4 series is how the unwritten conversion has changed, like:

xfs_iomap_write_unwritten()
{
	unsigned int rounding;

	/* when converting anything unwritten, we must be spanning an alloc 
unit, so round up/down */
	if (rounding > 1) {
		offset_fsb = rounddown(rounding);
		count_fsb = roundup(rounding);
	}

	...
	do {
		xfs_bmapi_write();
		...
		xfs_trans_commit();
	} while ();
}

I'm not too happy with it and it seems a bit of a bodge, as I would 
rather we report the complete size written (user data and zeroes); then 
xfs_iomap_write_unwritten() would do proper individual block conversion. 
However, we do something similar for zeroing for sub-FSB writes. I am 
not sure if that is the same thing really, as we only round up to FSB 
size. Opinion?

Thanks,
John



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 3/7] fs: iomap: Atomic write support
  2024-08-30 15:48         ` John Garry
@ 2024-08-30 23:56           ` Darrick J. Wong
  2024-09-03 12:43             ` John Garry
  0 siblings, 1 reply; 28+ messages in thread
From: Darrick J. Wong @ 2024-08-30 23:56 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, brauner, viro, jack, chandan.babu, dchinner, hch,
	linux-block, linux-kernel, linux-xfs, linux-fsdevel, hare,
	martin.petersen, catherine.hoang, kbusch

On Fri, Aug 30, 2024 at 04:48:36PM +0100, John Garry wrote:
> On 22/08/2024 21:30, Darrick J. Wong wrote:
> > > Then, the iomap->type/flag is either IOMAP_UNWRITTEN/IOMAP_F_DIRTY or
> > > IOMAP_MAPPED/IOMAP_F_DIRTY per iter. So the type is not consistent. However
> > > we will set IOMAP_DIO_UNWRITTEN in dio->flags, so call xfs_dio_write_endio()
> > > -> xfs_iomap_write_unwritten() for the complete FSB range.
> > > 
> > > Do you see a problem with this?
> 
> Sorry again for the slow response.
> 
> > > 
> > > Please see this also for some more background:
> > > https://urldefense.com/v3/__https://lore.kernel.org/linux-
> > > xfs/20240726171358.GA27612@lst.de/__;!!ACWV5N9M2RV99hQ! P5jeP96F8wAtRAblbm8NvRo8nlpil03vA26UMMX8qrYa4IzKecAAk7x1l1M45bBshC3Czxn1CkDXypNSAg$
> > Yes -- if you have a mix of written and unwritten blocks for the same
> > chunk of physical space:
> > 
> > 0      7
> > WUWUWUWU
> > 
> > the directio ioend function will start four separate transactions to
> > convert blocks 1, 3, 5, and 7 to written status.  If the system crashes
> > midway through, they will see this afterwards:
> > 
> > WWWWW0W0
> > 
> > IOWs, although the*disk write* was completed successfully, the mapping
> > updates were torn, and the user program sees a torn write.
> > > The most performant/painful way to fix this would be to make the whole
> > ioend completion a logged operation so that we could commit to updating
> > all the unwritten mappings and restart it after a crash.
> 
> could we make it logged for those special cases which we are interested in
> only?

Yes, though this is the long route -- you get to define a new ondisk log
item, build all the incore structures to process them, and then define a
new high level operation that uses the state encoded in that new log
item to run all the ioend completion transactions within that framework.
Also you get to add a new log incompat feature bit for this.

Perhaps we should analyze the cost of writing and QA'ing all that vs.
the amount of time saved in the handling of this corner case using one
of the less exciting options.

> > The least performant of course is to write zeroes at allocation time,
> > like we do for fsdax.
> 
> That idea was already proposed:
> https://lore.kernel.org/linux-xfs/ZcGIPlNCkL6EDx3Z@dread.disaster.area/

Yes, I'm aware.

> > A possible middle ground would be to detect IOMAP_ATOMIC in the
> > ->iomap_begin method, notice that there are mixed mappings under the
> > proposed untorn IO, and pre-convert the unwritten blocks by writing
> > zeroes to disk and updating the mappings
> 
> Won't that have the same issue as using XFS_BMAPI_ZERO, above i.e. zeroing
> during allocation?

Only if you set the forcealign size to > 1fsb and fail to write new
file data in forcealign units, even for non-untorn writes.  If all
writes to the file are aligned to the forcealign size then there's only
one extent conversion to be done, and that cannot be torn.

> > before handing the one single
> > mapping back to iomap_dio_rw to stage the untorn writes bio.  At least
> > you'd only be suffering that penalty for the (probable) corner case of
> > someone creating mixed mappings.
> 
> BTW, one issue I have with the sub-extent(or -alloc unit) zeroing from v4
> series is how the unwritten conversion has changed, like:
> 
> xfs_iomap_write_unwritten()
> {
> 	unsigned int rounding;
> 
> 	/* when converting anything unwritten, we must be spanning an alloc unit,
> so round up/down */
> 	if (rounding > 1) {
> 		offset_fsb = rounddown(rounding);
> 		count_fsb = roundup(rounding);
> 	}
> 
> 	...
> 	do {
> 		xfs_bmapi_write();
> 		...
> 		xfs_trans_commit();
> 	} while ();
> }
> 
> I'm not too happy with it and it seems a bit of a bodge, as I would rather
> we report the complete size written (user data and zeroes); then
> xfs_iomap_write_unwritten() would do proper individual block conversion.
> However, we do something similar for zeroing for sub-FSB writes. I am not
> sure if that is the same thing really, as we only round up to FSB size.
> Opinion?

xfs_iomap_write_unwritten is in the ioend path; that's not what I was
talking about.

I'm talking about a separate change to the xfs_direct_write_iomap_begin
function that would detect the case where the bmapi_read returns an
@imap that doesn't span the whole forcealign region, then repeatedly
calls bmapi_write(BMAPI_ZERO | BMAPI_CONVERT) on any unwritten mappings
within that file range until the original bmapi_read would return a
single written mapping.

--D

> 
> Thanks,
> John
> 
> 
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 3/7] fs: iomap: Atomic write support
  2024-08-30 23:56           ` Darrick J. Wong
@ 2024-09-03 12:43             ` John Garry
  0 siblings, 0 replies; 28+ messages in thread
From: John Garry @ 2024-09-03 12:43 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: axboe, brauner, viro, jack, chandan.babu, dchinner, hch,
	linux-block, linux-kernel, linux-xfs, linux-fsdevel, hare,
	martin.petersen, catherine.hoang, kbusch

On 31/08/2024 00:56, Darrick J. Wong wrote:
>>> IOWs, although the*disk write* was completed successfully, the mapping
>>> updates were torn, and the user program sees a torn write.
>>>> The most performant/painful way to fix this would be to make the whole
>>> ioend completion a logged operation so that we could commit to updating
>>> all the unwritten mappings and restart it after a crash.
>> could we make it logged for those special cases which we are interested in
>> only?
> Yes, though this is the long route -- you get to define a new ondisk log
> item, build all the incore structures to process them, and then define a
> new high level operation that uses the state encoded in that new log
> item to run all the ioend completion transactions within that framework.
> Also you get to add a new log incompat feature bit for this.
> 
> Perhaps we should analyze the cost of writing and QA'ing all that vs.
> the amount of time saved in the handling of this corner case using one
> of the less exciting options.

 From the sound of all the changes required, I am not too keen on that 
option...

> 
>>> The least performant of course is to write zeroes at allocation time,
>>> like we do for fsdax.
>> That idea was already proposed:
>> https://urldefense.com/v3/__https://lore.kernel.org/linux-xfs/ 
>> ZcGIPlNCkL6EDx3Z@dread.disaster.area/__;!!ACWV5N9M2RV99hQ! 
>> Kmx2Rrrot3GTqBS3kwhTi1nxIrpiPDyiy3TEfowsRKonvY90W7o4xUv9r9seOfDAMa2gT- 
>> TCNVlpH-CGXA$ 
> Yes, I'm aware.
> 
>>> A possible middle ground would be to detect IOMAP_ATOMIC in the
>>> ->iomap_begin method, notice that there are mixed mappings under the
>>> proposed untorn IO, and pre-convert the unwritten blocks by writing
>>> zeroes to disk and updating the mappings
>> Won't that have the same issue as using XFS_BMAPI_ZERO, above i.e. zeroing
>> during allocation?
> Only if you set the forcealign size to > 1fsb and fail to write new
> file data in forcealign units, even for non-untorn writes.  If all
> writes to the file are aligned to the forcealign size then there's only
> one extent conversion to be done, and that cannot be torn.
 > >>> before handing the one single
>>> mapping back to iomap_dio_rw to stage the untorn writes bio.  At least
>>> you'd only be suffering that penalty for the (probable) corner case of
>>> someone creating mixed mappings.
>> BTW, one issue I have with the sub-extent(or -alloc unit) zeroing from v4
>> series is how the unwritten conversion has changed, like:
>>
>> xfs_iomap_write_unwritten()
>> {
>> 	unsigned int rounding;
>>
>> 	/* when converting anything unwritten, we must be spanning an alloc unit,
>> so round up/down */
>> 	if (rounding > 1) {
>> 		offset_fsb = rounddown(rounding);
>> 		count_fsb = roundup(rounding);
>> 	}
>>
>> 	...
>> 	do {
>> 		xfs_bmapi_write();
>> 		...
>> 		xfs_trans_commit();
>> 	} while ();
>> }
>>
>> I'm not too happy with it and it seems a bit of a bodge, as I would rather
>> we report the complete size written (user data and zeroes); then
>> xfs_iomap_write_unwritten() would do proper individual block conversion.
>> However, we do something similar for zeroing for sub-FSB writes. I am not
>> sure if that is the same thing really, as we only round up to FSB size.
>> Opinion?
> xfs_iomap_write_unwritten is in the ioend path; that's not what I was
> talking about.

Sure, it's not the same as what you are talking about, but I am just 
mentioning it as it was included in my sub-FS extent zeroing solution 
and I am not too happy about it. It's just a concern there.

> 
> I'm talking about a separate change to the xfs_direct_write_iomap_begin
> function that would detect the case where the bmapi_read returns an
> @imap that doesn't span the whole forcealign region, then repeatedly
> calls bmapi_write(BMAPI_ZERO | BMAPI_CONVERT) on any unwritten mappings
> within that file range until the original bmapi_read would return a
> single written mapping.

Right, I get the idea. I'll check it further.

Cheers,
John



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v5 4/7] xfs: Support FS_XFLAG_ATOMICWRITES for forcealign
  2024-08-17  9:47 [PATCH v5 0/7] block atomic writes for xfs John Garry
                   ` (2 preceding siblings ...)
  2024-08-17  9:47 ` [PATCH v5 3/7] fs: iomap: Atomic write support John Garry
@ 2024-08-17  9:47 ` John Garry
  2024-08-21 17:07   ` Darrick J. Wong
  2024-08-17  9:47 ` [PATCH v5 5/7] xfs: Support atomic write for statx John Garry
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 28+ messages in thread
From: John Garry @ 2024-08-17  9:47 UTC (permalink / raw)
  To: axboe, brauner, djwong, viro, jack, chandan.babu, dchinner, hch
  Cc: linux-block, linux-kernel, linux-xfs, linux-fsdevel, hare,
	martin.petersen, catherine.hoang, kbusch, John Garry

Add initial support for new flag FS_XFLAG_ATOMICWRITES for forcealign
enabled.

This flag is a file attribute that mirrors an ondisk inode flag.  Actual
support for untorn file writes (for now) depends on both the iflag and the
underlying storage devices, which we can only really check at statx and
pwritev2() time.  This is the same story as FS_XFLAG_DAX, which signals to
the fs that we should try to enable the fsdax IO path on the file (instead
of the regular page cache), but applications have to query STAT_ATTR_DAX to
find out if they really got that IO path.

Current kernel support for atomic writes is based on HW support (for atomic
writes). As such, it is required to ensure extent alignment with
atomic_write_unit_max so that an atomic write can result in a single
HW-compliant IO operation.

rtvol also guarantees extent alignment, but we are basing support initially
on forcealign, which is not supported for rtvol yet.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/libxfs/xfs_format.h     | 11 +++++--
 fs/xfs/libxfs/xfs_inode_buf.c  | 52 ++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_inode_util.c |  4 +++
 fs/xfs/libxfs/xfs_sb.c         |  2 ++
 fs/xfs/xfs_buf.c               | 15 +++++++++-
 fs/xfs/xfs_buf.h               |  4 ++-
 fs/xfs/xfs_buf_mem.c           |  2 +-
 fs/xfs/xfs_inode.h             |  5 ++++
 fs/xfs/xfs_ioctl.c             | 52 ++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_mount.h             |  2 ++
 fs/xfs/xfs_super.c             | 12 ++++++++
 include/uapi/linux/fs.h        |  1 +
 12 files changed, 157 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 04c6cbc943c2..a9f3389438a6 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -353,12 +353,16 @@ xfs_sb_has_compat_feature(
 #define XFS_SB_FEAT_RO_COMPAT_REFLINK  (1 << 2)		/* reflinked files */
 #define XFS_SB_FEAT_RO_COMPAT_INOBTCNT (1 << 3)		/* inobt block counts */
 #define XFS_SB_FEAT_RO_COMPAT_FORCEALIGN (1 << 30)	/* aligned file data extents */
+#define XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES (1 << 31)	/* atomicwrites enabled */
+
 #define XFS_SB_FEAT_RO_COMPAT_ALL \
 		(XFS_SB_FEAT_RO_COMPAT_FINOBT | \
 		 XFS_SB_FEAT_RO_COMPAT_RMAPBT | \
 		 XFS_SB_FEAT_RO_COMPAT_REFLINK| \
 		 XFS_SB_FEAT_RO_COMPAT_INOBTCNT | \
-		 XFS_SB_FEAT_RO_COMPAT_FORCEALIGN)
+		 XFS_SB_FEAT_RO_COMPAT_FORCEALIGN | \
+		 XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES)
+
 #define XFS_SB_FEAT_RO_COMPAT_UNKNOWN	~XFS_SB_FEAT_RO_COMPAT_ALL
 static inline bool
 xfs_sb_has_ro_compat_feature(
@@ -1097,6 +1101,7 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
 #define XFS_DIFLAG2_NREXT64_BIT 4	/* large extent counters */
 /* data extent mappings for regular files must be aligned to extent size hint */
 #define XFS_DIFLAG2_FORCEALIGN_BIT 5
+#define XFS_DIFLAG2_ATOMICWRITES_BIT 6
 
 #define XFS_DIFLAG2_DAX		(1 << XFS_DIFLAG2_DAX_BIT)
 #define XFS_DIFLAG2_REFLINK     (1 << XFS_DIFLAG2_REFLINK_BIT)
@@ -1104,10 +1109,12 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
 #define XFS_DIFLAG2_BIGTIME	(1 << XFS_DIFLAG2_BIGTIME_BIT)
 #define XFS_DIFLAG2_NREXT64	(1 << XFS_DIFLAG2_NREXT64_BIT)
 #define XFS_DIFLAG2_FORCEALIGN	(1 << XFS_DIFLAG2_FORCEALIGN_BIT)
+#define XFS_DIFLAG2_ATOMICWRITES	(1 << XFS_DIFLAG2_ATOMICWRITES_BIT)
 
 #define XFS_DIFLAG2_ANY \
 	(XFS_DIFLAG2_DAX | XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE | \
-	 XFS_DIFLAG2_BIGTIME | XFS_DIFLAG2_NREXT64 | XFS_DIFLAG2_FORCEALIGN)
+	 XFS_DIFLAG2_BIGTIME | XFS_DIFLAG2_NREXT64 | XFS_DIFLAG2_FORCEALIGN | \
+	 XFS_DIFLAG2_ATOMICWRITES)
 
 static inline bool xfs_dinode_has_bigtime(const struct xfs_dinode *dip)
 {
diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index 1c59891fa9e2..59933c7df56d 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -178,7 +178,10 @@ xfs_inode_from_disk(
 	struct xfs_inode	*ip,
 	struct xfs_dinode	*from)
 {
+	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
 	struct inode		*inode = VFS_I(ip);
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_sb		*sbp = &mp->m_sb;
 	int			error;
 	xfs_failaddr_t		fa;
 
@@ -261,6 +264,13 @@ xfs_inode_from_disk(
 	}
 	if (xfs_is_reflink_inode(ip))
 		xfs_ifork_init_cow(ip);
+
+	if (xfs_inode_has_atomicwrites(ip)) {
+		if (sbp->sb_blocksize < target->bt_bdev_awu_min ||
+		    sbp->sb_blocksize * ip->i_extsize > target->bt_bdev_awu_max)
+			ip->i_diflags2 &= ~XFS_DIFLAG2_ATOMICWRITES;
+	}
+
 	return 0;
 
 out_destroy_data_fork:
@@ -483,6 +493,40 @@ xfs_dinode_verify_nrext64(
 	return NULL;
 }
 
+static xfs_failaddr_t
+xfs_inode_validate_atomicwrites(
+	struct xfs_mount	*mp,
+	uint32_t		extsize,
+	uint64_t		flags2)
+{
+	/* superblock rocompat feature flag */
+	if (!xfs_has_atomicwrites(mp))
+		return __this_address;
+
+	/*
+	 * forcealign is required, so rely on sanity checks in
+	 * xfs_inode_validate_forcealign()
+	 */
+	if (!(flags2 & XFS_DIFLAG2_FORCEALIGN))
+		return __this_address;
+
+	if (!is_power_of_2(extsize))
+		return __this_address;
+
+	/* Required to guarantee data block alignment */
+	if (mp->m_sb.sb_agblocks % extsize)
+		return __this_address;
+
+	/* Requires stripe unit+width be a multiple of extsize */
+	if (mp->m_dalign && (mp->m_dalign % extsize))
+		return __this_address;
+
+	if (mp->m_swidth && (mp->m_swidth % extsize))
+		return __this_address;
+
+	return NULL;
+}
+
 xfs_failaddr_t
 xfs_dinode_verify(
 	struct xfs_mount	*mp,
@@ -666,6 +710,14 @@ xfs_dinode_verify(
 			return fa;
 	}
 
+	if (flags2 & XFS_DIFLAG2_ATOMICWRITES) {
+		fa = xfs_inode_validate_atomicwrites(mp,
+				be32_to_cpu(dip->di_extsize),
+				flags2);
+		if (fa)
+			return fa;
+	}
+
 	return NULL;
 }
 
diff --git a/fs/xfs/libxfs/xfs_inode_util.c b/fs/xfs/libxfs/xfs_inode_util.c
index b264939d8855..dbd5b16e1844 100644
--- a/fs/xfs/libxfs/xfs_inode_util.c
+++ b/fs/xfs/libxfs/xfs_inode_util.c
@@ -82,6 +82,8 @@ xfs_flags2diflags2(
 		di_flags2 |= XFS_DIFLAG2_COWEXTSIZE;
 	if (xflags & FS_XFLAG_FORCEALIGN)
 		di_flags2 |= XFS_DIFLAG2_FORCEALIGN;
+	if (xflags & FS_XFLAG_ATOMICWRITES)
+		di_flags2 |= XFS_DIFLAG2_ATOMICWRITES;
 
 	return di_flags2;
 }
@@ -130,6 +132,8 @@ xfs_ip2xflags(
 			flags |= FS_XFLAG_COWEXTSIZE;
 		if (ip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN)
 			flags |= FS_XFLAG_FORCEALIGN;
+		if (ip->i_diflags2 & XFS_DIFLAG2_ATOMICWRITES)
+			flags |= FS_XFLAG_ATOMICWRITES;
 	}
 
 	if (xfs_inode_has_attr_fork(ip))
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index e56911553edd..5de8725bf93a 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -166,6 +166,8 @@ xfs_sb_version_to_features(
 		features |= XFS_FEAT_INOBTCNT;
 	if (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_FORCEALIGN)
 		features |= XFS_FEAT_FORCEALIGN;
+	if (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES)
+		features |= XFS_FEAT_ATOMICWRITES;
 	if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_FTYPE)
 		features |= XFS_FEAT_FTYPE;
 	if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_SPINODES)
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index aa4dbda7b536..44bee3e2b2bb 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -2060,6 +2060,8 @@ int
 xfs_init_buftarg(
 	struct xfs_buftarg		*btp,
 	size_t				logical_sectorsize,
+	unsigned int			awu_min,
+	unsigned int			awu_max,
 	const char			*descr)
 {
 	/* Set up device logical sector size mask */
@@ -2086,6 +2088,9 @@ xfs_init_buftarg(
 	btp->bt_shrinker->scan_objects = xfs_buftarg_shrink_scan;
 	btp->bt_shrinker->private_data = btp;
 	shrinker_register(btp->bt_shrinker);
+
+	btp->bt_bdev_awu_min = awu_min;
+	btp->bt_bdev_awu_max = awu_max;
 	return 0;
 
 out_destroy_io_count:
@@ -2102,6 +2107,7 @@ xfs_alloc_buftarg(
 {
 	struct xfs_buftarg	*btp;
 	const struct dax_holder_operations *ops = NULL;
+	unsigned int awu_min = 0, awu_max = 0;
 
 #if defined(CONFIG_FS_DAX) && defined(CONFIG_MEMORY_FAILURE)
 	ops = &xfs_dax_holder_operations;
@@ -2115,6 +2121,13 @@ xfs_alloc_buftarg(
 	btp->bt_daxdev = fs_dax_get_by_bdev(btp->bt_bdev, &btp->bt_dax_part_off,
 					    mp, ops);
 
+	if (bdev_can_atomic_write(btp->bt_bdev)) {
+		struct request_queue *q = bdev_get_queue(btp->bt_bdev);
+
+		awu_min = queue_atomic_write_unit_min_bytes(q);
+		awu_max = queue_atomic_write_unit_max_bytes(q);
+	}
+
 	/*
 	 * When allocating the buftargs we have not yet read the super block and
 	 * thus don't know the file system sector size yet.
@@ -2122,7 +2135,7 @@ xfs_alloc_buftarg(
 	if (xfs_setsize_buftarg(btp, bdev_logical_block_size(btp->bt_bdev)))
 		goto error_free;
 	if (xfs_init_buftarg(btp, bdev_logical_block_size(btp->bt_bdev),
-			mp->m_super->s_id))
+			awu_min, awu_max, mp->m_super->s_id))
 		goto error_free;
 
 	return btp;
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index b1580644501f..3bcd8137d739 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -124,6 +124,8 @@ struct xfs_buftarg {
 	struct percpu_counter	bt_io_count;
 	struct ratelimit_state	bt_ioerror_rl;
 
+	unsigned int		bt_bdev_awu_min, bt_bdev_awu_max;
+
 	/* built-in cache, if we're not using the perag one */
 	struct xfs_buf_cache	bt_cache[];
 };
@@ -393,7 +395,7 @@ bool xfs_verify_magic16(struct xfs_buf *bp, __be16 dmagic);
 
 /* for xfs_buf_mem.c only: */
 int xfs_init_buftarg(struct xfs_buftarg *btp, size_t logical_sectorsize,
-		const char *descr);
+		unsigned int awu_min, unsigned int awu_max, const char *descr);
 void xfs_destroy_buftarg(struct xfs_buftarg *btp);
 
 #endif	/* __XFS_BUF_H__ */
diff --git a/fs/xfs/xfs_buf_mem.c b/fs/xfs/xfs_buf_mem.c
index 9bb2d24de709..af48a8da2f0f 100644
--- a/fs/xfs/xfs_buf_mem.c
+++ b/fs/xfs/xfs_buf_mem.c
@@ -93,7 +93,7 @@ xmbuf_alloc(
 	btp->bt_meta_sectorsize = XMBUF_BLOCKSIZE;
 	btp->bt_meta_sectormask = XMBUF_BLOCKSIZE - 1;
 
-	error = xfs_init_buftarg(btp, XMBUF_BLOCKSIZE, descr);
+	error = xfs_init_buftarg(btp, XMBUF_BLOCKSIZE, 0, 0, descr);
 	if (error)
 		goto out_bcache;
 
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 336124105c47..cfcb67da12cb 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -321,6 +321,11 @@ static inline bool xfs_inode_has_forcealign(struct xfs_inode *ip)
 	return ip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN;
 }
 
+static inline bool xfs_inode_has_atomicwrites(struct xfs_inode *ip)
+{
+	return ip->i_diflags2 & XFS_DIFLAG2_ATOMICWRITES;
+}
+
 /*
  * Decide if this file is a realtime file whose data allocation unit is larger
  * than a single filesystem block.
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index a68ec68e5b92..7af2837779e8 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -502,6 +502,49 @@ xfs_ioctl_setattr_forcealign(
 	return 0;
 }
 
+
+/*
+ * Forcealign requires a power-of-2 extent size hint.
+ */
+static int
+xfs_ioctl_setattr_atomicwrites(
+	struct xfs_inode	*ip,
+	struct fileattr		*fa)
+{
+	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
+	struct xfs_mount	*mp = ip->i_mount;
+	uint32_t		extsize = XFS_B_TO_FSB(mp, fa->fsx_extsize);
+	struct xfs_sb		*sbp = &mp->m_sb;
+
+	if (!xfs_has_atomicwrites(mp))
+		return -EINVAL;
+
+	if (!(fa->fsx_xflags & FS_XFLAG_FORCEALIGN))
+		return -EINVAL;
+
+	if (!is_power_of_2(extsize))
+		return -EINVAL;
+
+	/* Required to guarantee data block alignment */
+	if (mp->m_sb.sb_agblocks % extsize)
+		return -EINVAL;
+
+	/* Requires stripe unit+width be a multiple of extsize */
+	if (mp->m_dalign && (mp->m_dalign % extsize))
+		return -EINVAL;
+
+	if (mp->m_swidth && (mp->m_swidth % extsize))
+		return -EINVAL;
+
+	if (target->bt_bdev_awu_min > sbp->sb_blocksize)
+		return -EINVAL;
+
+	if (target->bt_bdev_awu_max < fa->fsx_extsize)
+		return -EINVAL;
+
+	return 0;
+}
+
 static int
 xfs_ioctl_setattr_xflags(
 	struct xfs_trans	*tp,
@@ -511,9 +554,12 @@ xfs_ioctl_setattr_xflags(
 	struct xfs_mount	*mp = ip->i_mount;
 	bool			rtflag = (fa->fsx_xflags & FS_XFLAG_REALTIME);
 	bool			forcealign = fa->fsx_xflags & FS_XFLAG_FORCEALIGN;
+	bool			atomic_writes;
 	uint64_t		i_flags2;
 	int			error;
 
+	atomic_writes = fa->fsx_xflags & FS_XFLAG_ATOMICWRITES;
+
 	/* Can't change RT or forcealign flags if any extents are allocated. */
 	if (rtflag != XFS_IS_REALTIME_INODE(ip) ||
 	    forcealign != xfs_inode_has_forcealign(ip)) {
@@ -554,6 +600,12 @@ xfs_ioctl_setattr_xflags(
 			return error;
 	}
 
+	if (atomic_writes) {
+		error = xfs_ioctl_setattr_atomicwrites(ip, fa);
+		if (error)
+			return error;
+	}
+
 	ip->i_diflags = xfs_flags2diflags(ip, fa->fsx_xflags);
 	ip->i_diflags2 = i_flags2;
 
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 30228fea908d..0c5a3ae3cdaf 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -300,6 +300,7 @@ typedef struct xfs_mount {
 #define XFS_FEAT_NREXT64	(1ULL << 26)	/* large extent counters */
 #define XFS_FEAT_EXCHANGE_RANGE	(1ULL << 27)	/* exchange range */
 #define XFS_FEAT_FORCEALIGN	(1ULL << 28)	/* aligned file data extents */
+#define XFS_FEAT_ATOMICWRITES	(1ULL << 29)	/* atomic writes support */
 
 /* Mount features */
 #define XFS_FEAT_NOATTR2	(1ULL << 48)	/* disable attr2 creation */
@@ -387,6 +388,7 @@ __XFS_HAS_V4_FEAT(v3inodes, V3INODES)
 __XFS_HAS_V4_FEAT(crc, CRC)
 __XFS_HAS_V4_FEAT(pquotino, PQUOTINO)
 __XFS_HAS_FEAT(forcealign, FORCEALIGN)
+__XFS_HAS_FEAT(atomicwrites, ATOMICWRITES)
 
 /*
  * Mount features
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index b52a01b50387..5352b90b2bb6 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1721,6 +1721,18 @@ xfs_fs_fill_super(
 		mp->m_features &= ~XFS_FEAT_DISCARD;
 	}
 
+	if (xfs_has_atomicwrites(mp)) {
+		if (!xfs_has_forcealign(mp)) {
+			xfs_alert(mp,
+	"forcealign required for atomicwrites!");
+			error = -EINVAL;
+			goto out_filestream_unmount;
+		}
+
+		xfs_warn(mp,
+"EXPERIMENTAL atomicwrites feature in use. Use at your own risk!");
+	}
+
 	if (xfs_has_reflink(mp)) {
 		if (mp->m_sb.sb_rblocks) {
 			xfs_alert(mp,
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index f55d650f904a..c416f549e94d 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -160,6 +160,7 @@ struct fsxattr {
 #define FS_XFLAG_COWEXTSIZE	0x00010000	/* CoW extent size allocator hint */
 /* data extent mappings for regular files must be aligned to extent size hint */
 #define FS_XFLAG_FORCEALIGN	0x00020000
+#define FS_XFLAG_ATOMICWRITES	0x00040000	/* atomic writes enabled */
 #define FS_XFLAG_HASATTR	0x80000000	/* no DIFLAG for this	*/
 
 /* the read-only stuff doesn't really belong here, but any other place is
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 4/7] xfs: Support FS_XFLAG_ATOMICWRITES for forcealign
  2024-08-17  9:47 ` [PATCH v5 4/7] xfs: Support FS_XFLAG_ATOMICWRITES for forcealign John Garry
@ 2024-08-21 17:07   ` Darrick J. Wong
  2024-08-22 17:45     ` John Garry
  0 siblings, 1 reply; 28+ messages in thread
From: Darrick J. Wong @ 2024-08-21 17:07 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, brauner, viro, jack, chandan.babu, dchinner, hch,
	linux-block, linux-kernel, linux-xfs, linux-fsdevel, hare,
	martin.petersen, catherine.hoang, kbusch

On Sat, Aug 17, 2024 at 09:47:57AM +0000, John Garry wrote:
> Add initial support for new flag FS_XFLAG_ATOMICWRITES for forcealign
> enabled.
> 
> This flag is a file attribute that mirrors an ondisk inode flag.  Actual
> support for untorn file writes (for now) depends on both the iflag and the
> underlying storage devices, which we can only really check at statx and
> pwritev2() time.  This is the same story as FS_XFLAG_DAX, which signals to
> the fs that we should try to enable the fsdax IO path on the file (instead
> of the regular page cache), but applications have to query STAT_ATTR_DAX to
> find out if they really got that IO path.
> 
> Current kernel support for atomic writes is based on HW support (for atomic
> writes). As such, it is required to ensure extent alignment with
> atomic_write_unit_max so that an atomic write can result in a single
> HW-compliant IO operation.
> 
> rtvol also guarantees extent alignment, but we are basing support initially
> on forcealign, which is not supported for rtvol yet.
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  fs/xfs/libxfs/xfs_format.h     | 11 +++++--
>  fs/xfs/libxfs/xfs_inode_buf.c  | 52 ++++++++++++++++++++++++++++++++++
>  fs/xfs/libxfs/xfs_inode_util.c |  4 +++
>  fs/xfs/libxfs/xfs_sb.c         |  2 ++
>  fs/xfs/xfs_buf.c               | 15 +++++++++-
>  fs/xfs/xfs_buf.h               |  4 ++-
>  fs/xfs/xfs_buf_mem.c           |  2 +-
>  fs/xfs/xfs_inode.h             |  5 ++++
>  fs/xfs/xfs_ioctl.c             | 52 ++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_mount.h             |  2 ++
>  fs/xfs/xfs_super.c             | 12 ++++++++
>  include/uapi/linux/fs.h        |  1 +
>  12 files changed, 157 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> index 04c6cbc943c2..a9f3389438a6 100644
> --- a/fs/xfs/libxfs/xfs_format.h
> +++ b/fs/xfs/libxfs/xfs_format.h
> @@ -353,12 +353,16 @@ xfs_sb_has_compat_feature(
>  #define XFS_SB_FEAT_RO_COMPAT_REFLINK  (1 << 2)		/* reflinked files */
>  #define XFS_SB_FEAT_RO_COMPAT_INOBTCNT (1 << 3)		/* inobt block counts */
>  #define XFS_SB_FEAT_RO_COMPAT_FORCEALIGN (1 << 30)	/* aligned file data extents */
> +#define XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES (1 << 31)	/* atomicwrites enabled */

Do you ever see test failures in xfs/270?

> +
>  #define XFS_SB_FEAT_RO_COMPAT_ALL \
>  		(XFS_SB_FEAT_RO_COMPAT_FINOBT | \
>  		 XFS_SB_FEAT_RO_COMPAT_RMAPBT | \
>  		 XFS_SB_FEAT_RO_COMPAT_REFLINK| \
>  		 XFS_SB_FEAT_RO_COMPAT_INOBTCNT | \
> -		 XFS_SB_FEAT_RO_COMPAT_FORCEALIGN)
> +		 XFS_SB_FEAT_RO_COMPAT_FORCEALIGN | \
> +		 XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES)
> +
>  #define XFS_SB_FEAT_RO_COMPAT_UNKNOWN	~XFS_SB_FEAT_RO_COMPAT_ALL
>  static inline bool
>  xfs_sb_has_ro_compat_feature(
> @@ -1097,6 +1101,7 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
>  #define XFS_DIFLAG2_NREXT64_BIT 4	/* large extent counters */
>  /* data extent mappings for regular files must be aligned to extent size hint */
>  #define XFS_DIFLAG2_FORCEALIGN_BIT 5
> +#define XFS_DIFLAG2_ATOMICWRITES_BIT 6
>  
>  #define XFS_DIFLAG2_DAX		(1 << XFS_DIFLAG2_DAX_BIT)
>  #define XFS_DIFLAG2_REFLINK     (1 << XFS_DIFLAG2_REFLINK_BIT)
> @@ -1104,10 +1109,12 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
>  #define XFS_DIFLAG2_BIGTIME	(1 << XFS_DIFLAG2_BIGTIME_BIT)
>  #define XFS_DIFLAG2_NREXT64	(1 << XFS_DIFLAG2_NREXT64_BIT)
>  #define XFS_DIFLAG2_FORCEALIGN	(1 << XFS_DIFLAG2_FORCEALIGN_BIT)
> +#define XFS_DIFLAG2_ATOMICWRITES	(1 << XFS_DIFLAG2_ATOMICWRITES_BIT)
>  
>  #define XFS_DIFLAG2_ANY \
>  	(XFS_DIFLAG2_DAX | XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE | \
> -	 XFS_DIFLAG2_BIGTIME | XFS_DIFLAG2_NREXT64 | XFS_DIFLAG2_FORCEALIGN)
> +	 XFS_DIFLAG2_BIGTIME | XFS_DIFLAG2_NREXT64 | XFS_DIFLAG2_FORCEALIGN | \
> +	 XFS_DIFLAG2_ATOMICWRITES)
>  
>  static inline bool xfs_dinode_has_bigtime(const struct xfs_dinode *dip)
>  {
> diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> index 1c59891fa9e2..59933c7df56d 100644
> --- a/fs/xfs/libxfs/xfs_inode_buf.c
> +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> @@ -178,7 +178,10 @@ xfs_inode_from_disk(
>  	struct xfs_inode	*ip,
>  	struct xfs_dinode	*from)
>  {
> +	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
>  	struct inode		*inode = VFS_I(ip);
> +	struct xfs_mount	*mp = ip->i_mount;
> +	struct xfs_sb		*sbp = &mp->m_sb;
>  	int			error;
>  	xfs_failaddr_t		fa;
>  
> @@ -261,6 +264,13 @@ xfs_inode_from_disk(
>  	}
>  	if (xfs_is_reflink_inode(ip))
>  		xfs_ifork_init_cow(ip);
> +
> +	if (xfs_inode_has_atomicwrites(ip)) {
> +		if (sbp->sb_blocksize < target->bt_bdev_awu_min ||
> +		    sbp->sb_blocksize * ip->i_extsize > target->bt_bdev_awu_max)

Can this multiplication trigger integer overflows?

> +			ip->i_diflags2 &= ~XFS_DIFLAG2_ATOMICWRITES;

Ondisk iflag updates must use transactions.

Or you can fail IOCB_ATOMIC writes if XFS_DIFLAG2_ATOMICWRITES is set
but the forcealign blocksize doesn't fit with awu_min/max.

> +	}
> +
>  	return 0;
>  
>  out_destroy_data_fork:
> @@ -483,6 +493,40 @@ xfs_dinode_verify_nrext64(
>  	return NULL;
>  }
>  
> +static xfs_failaddr_t
> +xfs_inode_validate_atomicwrites(
> +	struct xfs_mount	*mp,
> +	uint32_t		extsize,
> +	uint64_t		flags2)
> +{
> +	/* superblock rocompat feature flag */
> +	if (!xfs_has_atomicwrites(mp))
> +		return __this_address;
> +
> +	/*
> +	 * forcealign is required, so rely on sanity checks in
> +	 * xfs_inode_validate_forcealign()
> +	 */
> +	if (!(flags2 & XFS_DIFLAG2_FORCEALIGN))
> +		return __this_address;
> +
> +	if (!is_power_of_2(extsize))
> +		return __this_address;
> +
> +	/* Required to guarantee data block alignment */
> +	if (mp->m_sb.sb_agblocks % extsize)
> +		return __this_address;
> +
> +	/* Requires stripe unit+width be a multiple of extsize */
> +	if (mp->m_dalign && (mp->m_dalign % extsize))
> +		return __this_address;
> +
> +	if (mp->m_swidth && (mp->m_swidth % extsize))

IIRC m_dalign and m_swidth can be set at mount time, which can result in
inode verifiers logging corruption errors if those parameters change.  I
think we should validate these two congruencies when setting
FMODE_CAN_ATOMIC_WRITE.

> +		return __this_address;
> +
> +	return NULL;
> +}
> +
>  xfs_failaddr_t
>  xfs_dinode_verify(
>  	struct xfs_mount	*mp,
> @@ -666,6 +710,14 @@ xfs_dinode_verify(
>  			return fa;
>  	}
>  
> +	if (flags2 & XFS_DIFLAG2_ATOMICWRITES) {
> +		fa = xfs_inode_validate_atomicwrites(mp,
> +				be32_to_cpu(dip->di_extsize),
> +				flags2);
> +		if (fa)
> +			return fa;
> +	}
> +
>  	return NULL;
>  }
>  
> diff --git a/fs/xfs/libxfs/xfs_inode_util.c b/fs/xfs/libxfs/xfs_inode_util.c
> index b264939d8855..dbd5b16e1844 100644
> --- a/fs/xfs/libxfs/xfs_inode_util.c
> +++ b/fs/xfs/libxfs/xfs_inode_util.c
> @@ -82,6 +82,8 @@ xfs_flags2diflags2(
>  		di_flags2 |= XFS_DIFLAG2_COWEXTSIZE;
>  	if (xflags & FS_XFLAG_FORCEALIGN)
>  		di_flags2 |= XFS_DIFLAG2_FORCEALIGN;
> +	if (xflags & FS_XFLAG_ATOMICWRITES)
> +		di_flags2 |= XFS_DIFLAG2_ATOMICWRITES;
>  
>  	return di_flags2;
>  }
> @@ -130,6 +132,8 @@ xfs_ip2xflags(
>  			flags |= FS_XFLAG_COWEXTSIZE;
>  		if (ip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN)
>  			flags |= FS_XFLAG_FORCEALIGN;
> +		if (ip->i_diflags2 & XFS_DIFLAG2_ATOMICWRITES)
> +			flags |= FS_XFLAG_ATOMICWRITES;
>  	}
>  
>  	if (xfs_inode_has_attr_fork(ip))
> diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
> index e56911553edd..5de8725bf93a 100644
> --- a/fs/xfs/libxfs/xfs_sb.c
> +++ b/fs/xfs/libxfs/xfs_sb.c
> @@ -166,6 +166,8 @@ xfs_sb_version_to_features(
>  		features |= XFS_FEAT_INOBTCNT;
>  	if (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_FORCEALIGN)
>  		features |= XFS_FEAT_FORCEALIGN;
> +	if (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES)
> +		features |= XFS_FEAT_ATOMICWRITES;
>  	if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_FTYPE)
>  		features |= XFS_FEAT_FTYPE;
>  	if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_SPINODES)
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index aa4dbda7b536..44bee3e2b2bb 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -2060,6 +2060,8 @@ int
>  xfs_init_buftarg(
>  	struct xfs_buftarg		*btp,
>  	size_t				logical_sectorsize,
> +	unsigned int			awu_min,
> +	unsigned int			awu_max,
>  	const char			*descr)
>  {
>  	/* Set up device logical sector size mask */
> @@ -2086,6 +2088,9 @@ xfs_init_buftarg(
>  	btp->bt_shrinker->scan_objects = xfs_buftarg_shrink_scan;
>  	btp->bt_shrinker->private_data = btp;
>  	shrinker_register(btp->bt_shrinker);
> +
> +	btp->bt_bdev_awu_min = awu_min;
> +	btp->bt_bdev_awu_max = awu_max;
>  	return 0;
>  
>  out_destroy_io_count:
> @@ -2102,6 +2107,7 @@ xfs_alloc_buftarg(
>  {
>  	struct xfs_buftarg	*btp;
>  	const struct dax_holder_operations *ops = NULL;
> +	unsigned int awu_min = 0, awu_max = 0;
>  
>  #if defined(CONFIG_FS_DAX) && defined(CONFIG_MEMORY_FAILURE)
>  	ops = &xfs_dax_holder_operations;
> @@ -2115,6 +2121,13 @@ xfs_alloc_buftarg(
>  	btp->bt_daxdev = fs_dax_get_by_bdev(btp->bt_bdev, &btp->bt_dax_part_off,
>  					    mp, ops);
>  
> +	if (bdev_can_atomic_write(btp->bt_bdev)) {
> +		struct request_queue *q = bdev_get_queue(btp->bt_bdev);
> +
> +		awu_min = queue_atomic_write_unit_min_bytes(q);
> +		awu_max = queue_atomic_write_unit_max_bytes(q);
> +	}
> +
>  	/*
>  	 * When allocating the buftargs we have not yet read the super block and
>  	 * thus don't know the file system sector size yet.
> @@ -2122,7 +2135,7 @@ xfs_alloc_buftarg(
>  	if (xfs_setsize_buftarg(btp, bdev_logical_block_size(btp->bt_bdev)))
>  		goto error_free;
>  	if (xfs_init_buftarg(btp, bdev_logical_block_size(btp->bt_bdev),
> -			mp->m_super->s_id))
> +			awu_min, awu_max, mp->m_super->s_id))
>  		goto error_free;
>  
>  	return btp;
> diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
> index b1580644501f..3bcd8137d739 100644
> --- a/fs/xfs/xfs_buf.h
> +++ b/fs/xfs/xfs_buf.h
> @@ -124,6 +124,8 @@ struct xfs_buftarg {
>  	struct percpu_counter	bt_io_count;
>  	struct ratelimit_state	bt_ioerror_rl;
>  
> +	unsigned int		bt_bdev_awu_min, bt_bdev_awu_max;

Please add a comment here about what these mean.  Not everyone is going
to know what "awu" abbreviates.

> +
>  	/* built-in cache, if we're not using the perag one */
>  	struct xfs_buf_cache	bt_cache[];
>  };
> @@ -393,7 +395,7 @@ bool xfs_verify_magic16(struct xfs_buf *bp, __be16 dmagic);
>  
>  /* for xfs_buf_mem.c only: */
>  int xfs_init_buftarg(struct xfs_buftarg *btp, size_t logical_sectorsize,
> -		const char *descr);
> +		unsigned int awu_min, unsigned int awu_max, const char *descr);
>  void xfs_destroy_buftarg(struct xfs_buftarg *btp);
>  
>  #endif	/* __XFS_BUF_H__ */
> diff --git a/fs/xfs/xfs_buf_mem.c b/fs/xfs/xfs_buf_mem.c
> index 9bb2d24de709..af48a8da2f0f 100644
> --- a/fs/xfs/xfs_buf_mem.c
> +++ b/fs/xfs/xfs_buf_mem.c
> @@ -93,7 +93,7 @@ xmbuf_alloc(
>  	btp->bt_meta_sectorsize = XMBUF_BLOCKSIZE;
>  	btp->bt_meta_sectormask = XMBUF_BLOCKSIZE - 1;
>  
> -	error = xfs_init_buftarg(btp, XMBUF_BLOCKSIZE, descr);
> +	error = xfs_init_buftarg(btp, XMBUF_BLOCKSIZE, 0, 0, descr);
>  	if (error)
>  		goto out_bcache;
>  
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index 336124105c47..cfcb67da12cb 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -321,6 +321,11 @@ static inline bool xfs_inode_has_forcealign(struct xfs_inode *ip)
>  	return ip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN;
>  }
>  
> +static inline bool xfs_inode_has_atomicwrites(struct xfs_inode *ip)
> +{
> +	return ip->i_diflags2 & XFS_DIFLAG2_ATOMICWRITES;
> +}
> +
>  /*
>   * Decide if this file is a realtime file whose data allocation unit is larger
>   * than a single filesystem block.
> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> index a68ec68e5b92..7af2837779e8 100644
> --- a/fs/xfs/xfs_ioctl.c
> +++ b/fs/xfs/xfs_ioctl.c
> @@ -502,6 +502,49 @@ xfs_ioctl_setattr_forcealign(
>  	return 0;
>  }
>  
> +
> +/*
> + * Forcealign requires a power-of-2 extent size hint.
> + */
> +static int
> +xfs_ioctl_setattr_atomicwrites(
> +	struct xfs_inode	*ip,
> +	struct fileattr		*fa)
> +{
> +	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
> +	struct xfs_mount	*mp = ip->i_mount;
> +	uint32_t		extsize = XFS_B_TO_FSB(mp, fa->fsx_extsize);
> +	struct xfs_sb		*sbp = &mp->m_sb;
> +
> +	if (!xfs_has_atomicwrites(mp))
> +		return -EINVAL;
> +
> +	if (!(fa->fsx_xflags & FS_XFLAG_FORCEALIGN))
> +		return -EINVAL;
> +
> +	if (!is_power_of_2(extsize))
> +		return -EINVAL;
> +
> +	/* Required to guarantee data block alignment */
> +	if (mp->m_sb.sb_agblocks % extsize)
> +		return -EINVAL;
> +
> +	/* Requires stripe unit+width be a multiple of extsize */
> +	if (mp->m_dalign && (mp->m_dalign % extsize))
> +		return -EINVAL;
> +
> +	if (mp->m_swidth && (mp->m_swidth % extsize))
> +		return -EINVAL;
> +
> +	if (target->bt_bdev_awu_min > sbp->sb_blocksize)
> +		return -EINVAL;
> +
> +	if (target->bt_bdev_awu_max < fa->fsx_extsize)
> +		return -EINVAL;
> +
> +	return 0;
> +}
> +
>  static int
>  xfs_ioctl_setattr_xflags(
>  	struct xfs_trans	*tp,
> @@ -511,9 +554,12 @@ xfs_ioctl_setattr_xflags(
>  	struct xfs_mount	*mp = ip->i_mount;
>  	bool			rtflag = (fa->fsx_xflags & FS_XFLAG_REALTIME);
>  	bool			forcealign = fa->fsx_xflags & FS_XFLAG_FORCEALIGN;
> +	bool			atomic_writes;
>  	uint64_t		i_flags2;
>  	int			error;
>  
> +	atomic_writes = fa->fsx_xflags & FS_XFLAG_ATOMICWRITES;
> +
>  	/* Can't change RT or forcealign flags if any extents are allocated. */
>  	if (rtflag != XFS_IS_REALTIME_INODE(ip) ||
>  	    forcealign != xfs_inode_has_forcealign(ip)) {
> @@ -554,6 +600,12 @@ xfs_ioctl_setattr_xflags(
>  			return error;
>  	}
>  
> +	if (atomic_writes) {
> +		error = xfs_ioctl_setattr_atomicwrites(ip, fa);
> +		if (error)
> +			return error;
> +	}
> +
>  	ip->i_diflags = xfs_flags2diflags(ip, fa->fsx_xflags);
>  	ip->i_diflags2 = i_flags2;
>  
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index 30228fea908d..0c5a3ae3cdaf 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -300,6 +300,7 @@ typedef struct xfs_mount {
>  #define XFS_FEAT_NREXT64	(1ULL << 26)	/* large extent counters */
>  #define XFS_FEAT_EXCHANGE_RANGE	(1ULL << 27)	/* exchange range */
>  #define XFS_FEAT_FORCEALIGN	(1ULL << 28)	/* aligned file data extents */
> +#define XFS_FEAT_ATOMICWRITES	(1ULL << 29)	/* atomic writes support */
>  
>  /* Mount features */
>  #define XFS_FEAT_NOATTR2	(1ULL << 48)	/* disable attr2 creation */
> @@ -387,6 +388,7 @@ __XFS_HAS_V4_FEAT(v3inodes, V3INODES)
>  __XFS_HAS_V4_FEAT(crc, CRC)
>  __XFS_HAS_V4_FEAT(pquotino, PQUOTINO)
>  __XFS_HAS_FEAT(forcealign, FORCEALIGN)
> +__XFS_HAS_FEAT(atomicwrites, ATOMICWRITES)
>  
>  /*
>   * Mount features
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index b52a01b50387..5352b90b2bb6 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1721,6 +1721,18 @@ xfs_fs_fill_super(
>  		mp->m_features &= ~XFS_FEAT_DISCARD;
>  	}
>  
> +	if (xfs_has_atomicwrites(mp)) {
> +		if (!xfs_has_forcealign(mp)) {
> +			xfs_alert(mp,
> +	"forcealign required for atomicwrites!");

This (atomicwrites && !forcealign) ought to be checked in the superblock
verifier.

--D

> +			error = -EINVAL;
> +			goto out_filestream_unmount;
> +		}
> +
> +		xfs_warn(mp,
> +"EXPERIMENTAL atomicwrites feature in use. Use at your own risk!");
> +	}
> +
>  	if (xfs_has_reflink(mp)) {
>  		if (mp->m_sb.sb_rblocks) {
>  			xfs_alert(mp,
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index f55d650f904a..c416f549e94d 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -160,6 +160,7 @@ struct fsxattr {
>  #define FS_XFLAG_COWEXTSIZE	0x00010000	/* CoW extent size allocator hint */
>  /* data extent mappings for regular files must be aligned to extent size hint */
>  #define FS_XFLAG_FORCEALIGN	0x00020000
> +#define FS_XFLAG_ATOMICWRITES	0x00040000	/* atomic writes enabled */
>  #define FS_XFLAG_HASATTR	0x80000000	/* no DIFLAG for this	*/
>  
>  /* the read-only stuff doesn't really belong here, but any other place is
> -- 
> 2.31.1
> 
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 4/7] xfs: Support FS_XFLAG_ATOMICWRITES for forcealign
  2024-08-21 17:07   ` Darrick J. Wong
@ 2024-08-22 17:45     ` John Garry
  2024-08-22 20:38       ` Darrick J. Wong
  0 siblings, 1 reply; 28+ messages in thread
From: John Garry @ 2024-08-22 17:45 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: axboe, brauner, viro, jack, chandan.babu, dchinner, hch,
	linux-block, linux-kernel, linux-xfs, linux-fsdevel, hare,
	martin.petersen, catherine.hoang, kbusch

On 21/08/2024 18:07, Darrick J. Wong wrote:
> On Sat, Aug 17, 2024 at 09:47:57AM +0000, John Garry wrote:
>> Add initial support for new flag FS_XFLAG_ATOMICWRITES for forcealign
>> enabled.
>>
>> This flag is a file attribute that mirrors an ondisk inode flag.  Actual
>> support for untorn file writes (for now) depends on both the iflag and the
>> underlying storage devices, which we can only really check at statx and
>> pwritev2() time.  This is the same story as FS_XFLAG_DAX, which signals to
>> the fs that we should try to enable the fsdax IO path on the file (instead
>> of the regular page cache), but applications have to query STAT_ATTR_DAX to
>> find out if they really got that IO path.
>>
>> Current kernel support for atomic writes is based on HW support (for atomic
>> writes). As such, it is required to ensure extent alignment with
>> atomic_write_unit_max so that an atomic write can result in a single
>> HW-compliant IO operation.
>>
>> rtvol also guarantees extent alignment, but we are basing support initially
>> on forcealign, which is not supported for rtvol yet.
>>
>> Signed-off-by: John Garry <john.g.garry@oracle.com>
>> ---
>>   fs/xfs/libxfs/xfs_format.h     | 11 +++++--
>>   fs/xfs/libxfs/xfs_inode_buf.c  | 52 ++++++++++++++++++++++++++++++++++
>>   fs/xfs/libxfs/xfs_inode_util.c |  4 +++
>>   fs/xfs/libxfs/xfs_sb.c         |  2 ++
>>   fs/xfs/xfs_buf.c               | 15 +++++++++-
>>   fs/xfs/xfs_buf.h               |  4 ++-
>>   fs/xfs/xfs_buf_mem.c           |  2 +-
>>   fs/xfs/xfs_inode.h             |  5 ++++
>>   fs/xfs/xfs_ioctl.c             | 52 ++++++++++++++++++++++++++++++++++
>>   fs/xfs/xfs_mount.h             |  2 ++
>>   fs/xfs/xfs_super.c             | 12 ++++++++
>>   include/uapi/linux/fs.h        |  1 +
>>   12 files changed, 157 insertions(+), 5 deletions(-)
>>
>> diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
>> index 04c6cbc943c2..a9f3389438a6 100644
>> --- a/fs/xfs/libxfs/xfs_format.h
>> +++ b/fs/xfs/libxfs/xfs_format.h
>> @@ -353,12 +353,16 @@ xfs_sb_has_compat_feature(
>>   #define XFS_SB_FEAT_RO_COMPAT_REFLINK  (1 << 2)		/* reflinked files */
>>   #define XFS_SB_FEAT_RO_COMPAT_INOBTCNT (1 << 3)		/* inobt block counts */
>>   #define XFS_SB_FEAT_RO_COMPAT_FORCEALIGN (1 << 30)	/* aligned file data extents */
>> +#define XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES (1 << 31)	/* atomicwrites enabled */
> 
> Do you ever see test failures in xfs/270?

Well it wasn't with forcealign only. I'll check again for atomic writes.

> 
>> +
>>   #define XFS_SB_FEAT_RO_COMPAT_ALL \
>>   		(XFS_SB_FEAT_RO_COMPAT_FINOBT | \
>>   		 XFS_SB_FEAT_RO_COMPAT_RMAPBT | \
>>   		 XFS_SB_FEAT_RO_COMPAT_REFLINK| \
>>   		 XFS_SB_FEAT_RO_COMPAT_INOBTCNT | \
>> -		 XFS_SB_FEAT_RO_COMPAT_FORCEALIGN)
>> +		 XFS_SB_FEAT_RO_COMPAT_FORCEALIGN | \
>> +		 XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES)
>> +
>>   #define XFS_SB_FEAT_RO_COMPAT_UNKNOWN	~XFS_SB_FEAT_RO_COMPAT_ALL
>>   static inline bool
>>   xfs_sb_has_ro_compat_feature(
>> @@ -1097,6 +1101,7 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
>>   #define XFS_DIFLAG2_NREXT64_BIT 4	/* large extent counters */
>>   /* data extent mappings for regular files must be aligned to extent size hint */
>>   #define XFS_DIFLAG2_FORCEALIGN_BIT 5
>> +#define XFS_DIFLAG2_ATOMICWRITES_BIT 6
>>   
>>   #define XFS_DIFLAG2_DAX		(1 << XFS_DIFLAG2_DAX_BIT)
>>   #define XFS_DIFLAG2_REFLINK     (1 << XFS_DIFLAG2_REFLINK_BIT)
>> @@ -1104,10 +1109,12 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
>>   #define XFS_DIFLAG2_BIGTIME	(1 << XFS_DIFLAG2_BIGTIME_BIT)
>>   #define XFS_DIFLAG2_NREXT64	(1 << XFS_DIFLAG2_NREXT64_BIT)
>>   #define XFS_DIFLAG2_FORCEALIGN	(1 << XFS_DIFLAG2_FORCEALIGN_BIT)
>> +#define XFS_DIFLAG2_ATOMICWRITES	(1 << XFS_DIFLAG2_ATOMICWRITES_BIT)
>>   
>>   #define XFS_DIFLAG2_ANY \
>>   	(XFS_DIFLAG2_DAX | XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE | \
>> -	 XFS_DIFLAG2_BIGTIME | XFS_DIFLAG2_NREXT64 | XFS_DIFLAG2_FORCEALIGN)
>> +	 XFS_DIFLAG2_BIGTIME | XFS_DIFLAG2_NREXT64 | XFS_DIFLAG2_FORCEALIGN | \
>> +	 XFS_DIFLAG2_ATOMICWRITES)
>>   
>>   static inline bool xfs_dinode_has_bigtime(const struct xfs_dinode *dip)
>>   {
>> diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
>> index 1c59891fa9e2..59933c7df56d 100644
>> --- a/fs/xfs/libxfs/xfs_inode_buf.c
>> +++ b/fs/xfs/libxfs/xfs_inode_buf.c
>> @@ -178,7 +178,10 @@ xfs_inode_from_disk(
>>   	struct xfs_inode	*ip,
>>   	struct xfs_dinode	*from)
>>   {
>> +	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
>>   	struct inode		*inode = VFS_I(ip);
>> +	struct xfs_mount	*mp = ip->i_mount;
>> +	struct xfs_sb		*sbp = &mp->m_sb;
>>   	int			error;
>>   	xfs_failaddr_t		fa;
>>   
>> @@ -261,6 +264,13 @@ xfs_inode_from_disk(
>>   	}
>>   	if (xfs_is_reflink_inode(ip))
>>   		xfs_ifork_init_cow(ip);
>> +
>> +	if (xfs_inode_has_atomicwrites(ip)) {
>> +		if (sbp->sb_blocksize < target->bt_bdev_awu_min ||
>> +		    sbp->sb_blocksize * ip->i_extsize > target->bt_bdev_awu_max)
> 
> Can this multiplication trigger integer overflows?

I should just use xfs_inode_alloc_unitsize()

> 
>> +			ip->i_diflags2 &= ~XFS_DIFLAG2_ATOMICWRITES;
> 
> Ondisk iflag updates must use transactions.

I want to change this.

The idea was to runtime clear this flag in case the bdev cannot satisfy 
the FS atomic write range, but that does not work.

> 
> Or you can fail IOCB_ATOMIC writes if XFS_DIFLAG2_ATOMICWRITES is set
> but the forcealign blocksize doesn't fit with awu_min/max.

I'd rather just not set FMODE_CAN_ATOMIC_WRITE

> 
>> +	}
>> +
>>   	return 0;
>>   
>>   out_destroy_data_fork:
>> @@ -483,6 +493,40 @@ xfs_dinode_verify_nrext64(
>>   	return NULL;
>>   }
>>   
>> +static xfs_failaddr_t
>> +xfs_inode_validate_atomicwrites(
>> +	struct xfs_mount	*mp,
>> +	uint32_t		extsize,
>> +	uint64_t		flags2)
>> +{
>> +	/* superblock rocompat feature flag */
>> +	if (!xfs_has_atomicwrites(mp))
>> +		return __this_address;
>> +
>> +	/*
>> +	 * forcealign is required, so rely on sanity checks in
>> +	 * xfs_inode_validate_forcealign()
>> +	 */
>> +	if (!(flags2 & XFS_DIFLAG2_FORCEALIGN))
>> +		return __this_address;
>> +
>> +	if (!is_power_of_2(extsize))
>> +		return __this_address;
>> +
>> +	/* Required to guarantee data block alignment */
>> +	if (mp->m_sb.sb_agblocks % extsize)
>> +		return __this_address;
>> +
>> +	/* Requires stripe unit+width be a multiple of extsize */
>> +	if (mp->m_dalign && (mp->m_dalign % extsize))
>> +		return __this_address;
>> +
>> +	if (mp->m_swidth && (mp->m_swidth % extsize))
> 
> IIRC m_dalign and m_swidth can be set at mount time,

I thought that these were fixed at mkfs time, however I see that the 
comment for xfs_update_alignment() mentions "values based on mount 
options". And we are reading mp values, and not sbp values, which is a 
good clue...

> which can result in
> inode verifiers logging corruption errors if those parameters change.  I
> think we should validate these two congruencies when setting
> FMODE_CAN_ATOMIC_WRITE.

That would make sense.

> 
>> +		return __this_address;
>> +
>> +	return NULL;
>> +}
>> +
>>   xfs_failaddr_t
>>   xfs_dinode_verify(
>>   	struct xfs_mount	*mp,
>> @@ -666,6 +710,14 @@ xfs_dinode_verify(
>>   			return fa;
>>   	}
>>   
>> +	if (flags2 & XFS_DIFLAG2_ATOMICWRITES) {
>> +		fa = xfs_inode_validate_atomicwrites(mp,
>> +				be32_to_cpu(dip->di_extsize),
>> +				flags2);
>> +		if (fa)
>> +			return fa;
>> +	}
>> +
>>   	return NULL;
>>   }
>>   
>> diff --git a/fs/xfs/libxfs/xfs_inode_util.c b/fs/xfs/libxfs/xfs_inode_util.c
>> index b264939d8855..dbd5b16e1844 100644
>> --- a/fs/xfs/libxfs/xfs_inode_util.c
>> +++ b/fs/xfs/libxfs/xfs_inode_util.c
>> @@ -82,6 +82,8 @@ xfs_flags2diflags2(
>>   		di_flags2 |= XFS_DIFLAG2_COWEXTSIZE;
>>   	if (xflags & FS_XFLAG_FORCEALIGN)
>>   		di_flags2 |= XFS_DIFLAG2_FORCEALIGN;
>> +	if (xflags & FS_XFLAG_ATOMICWRITES)
>> +		di_flags2 |= XFS_DIFLAG2_ATOMICWRITES;
>>   
>>   	return di_flags2;
>>   }
>> @@ -130,6 +132,8 @@ xfs_ip2xflags(
>>   			flags |= FS_XFLAG_COWEXTSIZE;
>>   		if (ip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN)
>>   			flags |= FS_XFLAG_FORCEALIGN;
>> +		if (ip->i_diflags2 & XFS_DIFLAG2_ATOMICWRITES)
>> +			flags |= FS_XFLAG_ATOMICWRITES;
>>   	}
>>   
>>   	if (xfs_inode_has_attr_fork(ip))
>> diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
>> index e56911553edd..5de8725bf93a 100644
>> --- a/fs/xfs/libxfs/xfs_sb.c
>> +++ b/fs/xfs/libxfs/xfs_sb.c
>> @@ -166,6 +166,8 @@ xfs_sb_version_to_features(
>>   		features |= XFS_FEAT_INOBTCNT;
>>   	if (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_FORCEALIGN)
>>   		features |= XFS_FEAT_FORCEALIGN;
>> +	if (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES)
>> +		features |= XFS_FEAT_ATOMICWRITES;
>>   	if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_FTYPE)
>>   		features |= XFS_FEAT_FTYPE;
>>   	if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_SPINODES)
>> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
>> index aa4dbda7b536..44bee3e2b2bb 100644
>> --- a/fs/xfs/xfs_buf.c
>> +++ b/fs/xfs/xfs_buf.c
>> @@ -2060,6 +2060,8 @@ int
>>   xfs_init_buftarg(
>>   	struct xfs_buftarg		*btp,
>>   	size_t				logical_sectorsize,
>> +	unsigned int			awu_min,
>> +	unsigned int			awu_max,
>>   	const char			*descr)
>>   {
>>   	/* Set up device logical sector size mask */
>> @@ -2086,6 +2088,9 @@ xfs_init_buftarg(
>>   	btp->bt_shrinker->scan_objects = xfs_buftarg_shrink_scan;
>>   	btp->bt_shrinker->private_data = btp;
>>   	shrinker_register(btp->bt_shrinker);
>> +
>> +	btp->bt_bdev_awu_min = awu_min;
>> +	btp->bt_bdev_awu_max = awu_max;
>>   	return 0;
>>   
>>   out_destroy_io_count:
>> @@ -2102,6 +2107,7 @@ xfs_alloc_buftarg(
>>   {
>>   	struct xfs_buftarg	*btp;
>>   	const struct dax_holder_operations *ops = NULL;
>> +	unsigned int awu_min = 0, awu_max = 0;
>>   
>>   #if defined(CONFIG_FS_DAX) && defined(CONFIG_MEMORY_FAILURE)
>>   	ops = &xfs_dax_holder_operations;
>> @@ -2115,6 +2121,13 @@ xfs_alloc_buftarg(
>>   	btp->bt_daxdev = fs_dax_get_by_bdev(btp->bt_bdev, &btp->bt_dax_part_off,
>>   					    mp, ops);
>>   
>> +	if (bdev_can_atomic_write(btp->bt_bdev)) {
>> +		struct request_queue *q = bdev_get_queue(btp->bt_bdev);
>> +
>> +		awu_min = queue_atomic_write_unit_min_bytes(q);
>> +		awu_max = queue_atomic_write_unit_max_bytes(q);
>> +	}
>> +
>>   	/*
>>   	 * When allocating the buftargs we have not yet read the super block and
>>   	 * thus don't know the file system sector size yet.
>> @@ -2122,7 +2135,7 @@ xfs_alloc_buftarg(
>>   	if (xfs_setsize_buftarg(btp, bdev_logical_block_size(btp->bt_bdev)))
>>   		goto error_free;
>>   	if (xfs_init_buftarg(btp, bdev_logical_block_size(btp->bt_bdev),
>> -			mp->m_super->s_id))
>> +			awu_min, awu_max, mp->m_super->s_id))
>>   		goto error_free;
>>   
>>   	return btp;
>> diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
>> index b1580644501f..3bcd8137d739 100644
>> --- a/fs/xfs/xfs_buf.h
>> +++ b/fs/xfs/xfs_buf.h
>> @@ -124,6 +124,8 @@ struct xfs_buftarg {
>>   	struct percpu_counter	bt_io_count;
>>   	struct ratelimit_state	bt_ioerror_rl;
>>   
>> +	unsigned int		bt_bdev_awu_min, bt_bdev_awu_max;
> 
> Please add a comment here about what these mean.  Not everyone is going
> to know what "awu" abbreviates.

sure

>> +
>>   static int
>>   xfs_ioctl_setattr_xflags(
>>   	struct xfs_trans	*tp,
>> @@ -511,9 +554,12 @@ xfs_ioctl_setattr_xflags(
>>   	struct xfs_mount	*mp = ip->i_mount;
>>   	bool			rtflag = (fa->fsx_xflags & FS_XFLAG_REALTIME);
>>   	bool			forcealign = fa->fsx_xflags & FS_XFLAG_FORCEALIGN;
>> +	bool			atomic_writes;
>>   	uint64_t		i_flags2;
>>   	int			error;
>>   
>> +	atomic_writes = fa->fsx_xflags & FS_XFLAG_ATOMICWRITES;
>> +
>>   	/* Can't change RT or forcealign flags if any extents are allocated. */
>>   	if (rtflag != XFS_IS_REALTIME_INODE(ip) ||
>>   	    forcealign != xfs_inode_has_forcealign(ip)) {
>> @@ -554,6 +600,12 @@ xfs_ioctl_setattr_xflags(
>>   			return error;
>>   	}
>>   
>> +	if (atomic_writes) {
>> +		error = xfs_ioctl_setattr_atomicwrites(ip, fa);
>> +		if (error)
>> +			return error;
>> +	}
>> +
>>   	ip->i_diflags = xfs_flags2diflags(ip, fa->fsx_xflags);
>>   	ip->i_diflags2 = i_flags2;
>>   
>> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
>> index 30228fea908d..0c5a3ae3cdaf 100644
>> --- a/fs/xfs/xfs_mount.h
>> +++ b/fs/xfs/xfs_mount.h
>> @@ -300,6 +300,7 @@ typedef struct xfs_mount {
>>   #define XFS_FEAT_NREXT64	(1ULL << 26)	/* large extent counters */
>>   #define XFS_FEAT_EXCHANGE_RANGE	(1ULL << 27)	/* exchange range */
>>   #define XFS_FEAT_FORCEALIGN	(1ULL << 28)	/* aligned file data extents */
>> +#define XFS_FEAT_ATOMICWRITES	(1ULL << 29)	/* atomic writes support */
>>   
>>   /* Mount features */
>>   #define XFS_FEAT_NOATTR2	(1ULL << 48)	/* disable attr2 creation */
>> @@ -387,6 +388,7 @@ __XFS_HAS_V4_FEAT(v3inodes, V3INODES)
>>   __XFS_HAS_V4_FEAT(crc, CRC)
>>   __XFS_HAS_V4_FEAT(pquotino, PQUOTINO)
>>   __XFS_HAS_FEAT(forcealign, FORCEALIGN)
>> +__XFS_HAS_FEAT(atomicwrites, ATOMICWRITES)
>>   
>>   /*
>>    * Mount features
>> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
>> index b52a01b50387..5352b90b2bb6 100644
>> --- a/fs/xfs/xfs_super.c
>> +++ b/fs/xfs/xfs_super.c
>> @@ -1721,6 +1721,18 @@ xfs_fs_fill_super(
>>   		mp->m_features &= ~XFS_FEAT_DISCARD;
>>   	}
>>   
>> +	if (xfs_has_atomicwrites(mp)) {
>> +		if (!xfs_has_forcealign(mp)) {
>> +			xfs_alert(mp,
>> +	"forcealign required for atomicwrites!");
> 
> This (atomicwrites && !forcealign) ought to be checked in the superblock
> verifier.

You mean in xfs_fs_validate_params(), right?

Thanks,
John

> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 4/7] xfs: Support FS_XFLAG_ATOMICWRITES for forcealign
  2024-08-22 17:45     ` John Garry
@ 2024-08-22 20:38       ` Darrick J. Wong
  2024-08-23  8:39         ` John Garry
  0 siblings, 1 reply; 28+ messages in thread
From: Darrick J. Wong @ 2024-08-22 20:38 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, brauner, viro, jack, chandan.babu, dchinner, hch,
	linux-block, linux-kernel, linux-xfs, linux-fsdevel, hare,
	martin.petersen, catherine.hoang, kbusch

On Thu, Aug 22, 2024 at 06:45:16PM +0100, John Garry wrote:
> On 21/08/2024 18:07, Darrick J. Wong wrote:
> > On Sat, Aug 17, 2024 at 09:47:57AM +0000, John Garry wrote:
> > > Add initial support for new flag FS_XFLAG_ATOMICWRITES for forcealign
> > > enabled.
> > > 
> > > This flag is a file attribute that mirrors an ondisk inode flag.  Actual
> > > support for untorn file writes (for now) depends on both the iflag and the
> > > underlying storage devices, which we can only really check at statx and
> > > pwritev2() time.  This is the same story as FS_XFLAG_DAX, which signals to
> > > the fs that we should try to enable the fsdax IO path on the file (instead
> > > of the regular page cache), but applications have to query STAT_ATTR_DAX to
> > > find out if they really got that IO path.
> > > 
> > > Current kernel support for atomic writes is based on HW support (for atomic
> > > writes). As such, it is required to ensure extent alignment with
> > > atomic_write_unit_max so that an atomic write can result in a single
> > > HW-compliant IO operation.
> > > 
> > > rtvol also guarantees extent alignment, but we are basing support initially
> > > on forcealign, which is not supported for rtvol yet.
> > > 
> > > Signed-off-by: John Garry <john.g.garry@oracle.com>
> > > ---
> > >   fs/xfs/libxfs/xfs_format.h     | 11 +++++--
> > >   fs/xfs/libxfs/xfs_inode_buf.c  | 52 ++++++++++++++++++++++++++++++++++
> > >   fs/xfs/libxfs/xfs_inode_util.c |  4 +++
> > >   fs/xfs/libxfs/xfs_sb.c         |  2 ++
> > >   fs/xfs/xfs_buf.c               | 15 +++++++++-
> > >   fs/xfs/xfs_buf.h               |  4 ++-
> > >   fs/xfs/xfs_buf_mem.c           |  2 +-
> > >   fs/xfs/xfs_inode.h             |  5 ++++
> > >   fs/xfs/xfs_ioctl.c             | 52 ++++++++++++++++++++++++++++++++++
> > >   fs/xfs/xfs_mount.h             |  2 ++
> > >   fs/xfs/xfs_super.c             | 12 ++++++++
> > >   include/uapi/linux/fs.h        |  1 +
> > >   12 files changed, 157 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > > index 04c6cbc943c2..a9f3389438a6 100644
> > > --- a/fs/xfs/libxfs/xfs_format.h
> > > +++ b/fs/xfs/libxfs/xfs_format.h
> > > @@ -353,12 +353,16 @@ xfs_sb_has_compat_feature(
> > >   #define XFS_SB_FEAT_RO_COMPAT_REFLINK  (1 << 2)		/* reflinked files */
> > >   #define XFS_SB_FEAT_RO_COMPAT_INOBTCNT (1 << 3)		/* inobt block counts */
> > >   #define XFS_SB_FEAT_RO_COMPAT_FORCEALIGN (1 << 30)	/* aligned file data extents */
> > > +#define XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES (1 << 31)	/* atomicwrites enabled */
> > 
> > Do you ever see test failures in xfs/270?
> 
> Well it wasn't with forcealign only. I'll check again for atomic writes.
> 
> > 
> > > +
> > >   #define XFS_SB_FEAT_RO_COMPAT_ALL \
> > >   		(XFS_SB_FEAT_RO_COMPAT_FINOBT | \
> > >   		 XFS_SB_FEAT_RO_COMPAT_RMAPBT | \
> > >   		 XFS_SB_FEAT_RO_COMPAT_REFLINK| \
> > >   		 XFS_SB_FEAT_RO_COMPAT_INOBTCNT | \
> > > -		 XFS_SB_FEAT_RO_COMPAT_FORCEALIGN)
> > > +		 XFS_SB_FEAT_RO_COMPAT_FORCEALIGN | \
> > > +		 XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES)
> > > +
> > >   #define XFS_SB_FEAT_RO_COMPAT_UNKNOWN	~XFS_SB_FEAT_RO_COMPAT_ALL
> > >   static inline bool
> > >   xfs_sb_has_ro_compat_feature(
> > > @@ -1097,6 +1101,7 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
> > >   #define XFS_DIFLAG2_NREXT64_BIT 4	/* large extent counters */
> > >   /* data extent mappings for regular files must be aligned to extent size hint */
> > >   #define XFS_DIFLAG2_FORCEALIGN_BIT 5
> > > +#define XFS_DIFLAG2_ATOMICWRITES_BIT 6
> > >   #define XFS_DIFLAG2_DAX		(1 << XFS_DIFLAG2_DAX_BIT)
> > >   #define XFS_DIFLAG2_REFLINK     (1 << XFS_DIFLAG2_REFLINK_BIT)
> > > @@ -1104,10 +1109,12 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
> > >   #define XFS_DIFLAG2_BIGTIME	(1 << XFS_DIFLAG2_BIGTIME_BIT)
> > >   #define XFS_DIFLAG2_NREXT64	(1 << XFS_DIFLAG2_NREXT64_BIT)
> > >   #define XFS_DIFLAG2_FORCEALIGN	(1 << XFS_DIFLAG2_FORCEALIGN_BIT)
> > > +#define XFS_DIFLAG2_ATOMICWRITES	(1 << XFS_DIFLAG2_ATOMICWRITES_BIT)
> > >   #define XFS_DIFLAG2_ANY \
> > >   	(XFS_DIFLAG2_DAX | XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE | \
> > > -	 XFS_DIFLAG2_BIGTIME | XFS_DIFLAG2_NREXT64 | XFS_DIFLAG2_FORCEALIGN)
> > > +	 XFS_DIFLAG2_BIGTIME | XFS_DIFLAG2_NREXT64 | XFS_DIFLAG2_FORCEALIGN | \
> > > +	 XFS_DIFLAG2_ATOMICWRITES)
> > >   static inline bool xfs_dinode_has_bigtime(const struct xfs_dinode *dip)
> > >   {
> > > diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> > > index 1c59891fa9e2..59933c7df56d 100644
> > > --- a/fs/xfs/libxfs/xfs_inode_buf.c
> > > +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> > > @@ -178,7 +178,10 @@ xfs_inode_from_disk(
> > >   	struct xfs_inode	*ip,
> > >   	struct xfs_dinode	*from)
> > >   {
> > > +	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
> > >   	struct inode		*inode = VFS_I(ip);
> > > +	struct xfs_mount	*mp = ip->i_mount;
> > > +	struct xfs_sb		*sbp = &mp->m_sb;
> > >   	int			error;
> > >   	xfs_failaddr_t		fa;
> > > @@ -261,6 +264,13 @@ xfs_inode_from_disk(
> > >   	}
> > >   	if (xfs_is_reflink_inode(ip))
> > >   		xfs_ifork_init_cow(ip);
> > > +
> > > +	if (xfs_inode_has_atomicwrites(ip)) {
> > > +		if (sbp->sb_blocksize < target->bt_bdev_awu_min ||
> > > +		    sbp->sb_blocksize * ip->i_extsize > target->bt_bdev_awu_max)
> > 
> > Can this multiplication trigger integer overflows?
> 
> I should just use xfs_inode_alloc_unitsize()
> 
> > 
> > > +			ip->i_diflags2 &= ~XFS_DIFLAG2_ATOMICWRITES;
> > 
> > Ondisk iflag updates must use transactions.
> 
> I want to change this.
> 
> The idea was to runtime clear this flag in case the bdev cannot satisfy the
> FS atomic write range, but that does not work.

Yeah, the ondisk and incore state have to be separate when we're playing
with hardware features because the fs image can be created on a device
that supports $feature and then moved to a device that does not.  For
that case you don't enable the incore state.  Though I suppose for
untorn writes you /could/ just let things EIO...

> > Or you can fail IOCB_ATOMIC writes if XFS_DIFLAG2_ATOMICWRITES is set
> > but the forcealign blocksize doesn't fit with awu_min/max.
> 
> I'd rather just not set FMODE_CAN_ATOMIC_WRITE

...but I'll move that discussion to that email.

> > > +	}
> > > +
> > >   	return 0;
> > >   out_destroy_data_fork:
> > > @@ -483,6 +493,40 @@ xfs_dinode_verify_nrext64(
> > >   	return NULL;
> > >   }
> > > +static xfs_failaddr_t
> > > +xfs_inode_validate_atomicwrites(
> > > +	struct xfs_mount	*mp,
> > > +	uint32_t		extsize,
> > > +	uint64_t		flags2)
> > > +{
> > > +	/* superblock rocompat feature flag */
> > > +	if (!xfs_has_atomicwrites(mp))
> > > +		return __this_address;
> > > +
> > > +	/*
> > > +	 * forcealign is required, so rely on sanity checks in
> > > +	 * xfs_inode_validate_forcealign()
> > > +	 */
> > > +	if (!(flags2 & XFS_DIFLAG2_FORCEALIGN))
> > > +		return __this_address;
> > > +
> > > +	if (!is_power_of_2(extsize))
> > > +		return __this_address;
> > > +
> > > +	/* Required to guarantee data block alignment */
> > > +	if (mp->m_sb.sb_agblocks % extsize)
> > > +		return __this_address;
> > > +
> > > +	/* Requires stripe unit+width be a multiple of extsize */
> > > +	if (mp->m_dalign && (mp->m_dalign % extsize))
> > > +		return __this_address;
> > > +
> > > +	if (mp->m_swidth && (mp->m_swidth % extsize))
> > 
> > IIRC m_dalign and m_swidth can be set at mount time,
> 
> I thought that these were fixed at mkfs time, however I see that the comment
> for xfs_update_alignment() mentions "values based on mount options". And we
> are reading mp values, and not sbp values, which is a good clue...

Yes.

> > which can result in
> > inode verifiers logging corruption errors if those parameters change.  I
> > think we should validate these two congruencies when setting
> > FMODE_CAN_ATOMIC_WRITE.
> 
> That would make sense.
> 
> > 
> > > +		return __this_address;
> > > +
> > > +	return NULL;
> > > +}
> > > +
> > >   xfs_failaddr_t
> > >   xfs_dinode_verify(
> > >   	struct xfs_mount	*mp,
> > > @@ -666,6 +710,14 @@ xfs_dinode_verify(
> > >   			return fa;
> > >   	}
> > > +	if (flags2 & XFS_DIFLAG2_ATOMICWRITES) {
> > > +		fa = xfs_inode_validate_atomicwrites(mp,
> > > +				be32_to_cpu(dip->di_extsize),
> > > +				flags2);
> > > +		if (fa)
> > > +			return fa;
> > > +	}
> > > +
> > >   	return NULL;
> > >   }
> > > diff --git a/fs/xfs/libxfs/xfs_inode_util.c b/fs/xfs/libxfs/xfs_inode_util.c
> > > index b264939d8855..dbd5b16e1844 100644
> > > --- a/fs/xfs/libxfs/xfs_inode_util.c
> > > +++ b/fs/xfs/libxfs/xfs_inode_util.c
> > > @@ -82,6 +82,8 @@ xfs_flags2diflags2(
> > >   		di_flags2 |= XFS_DIFLAG2_COWEXTSIZE;
> > >   	if (xflags & FS_XFLAG_FORCEALIGN)
> > >   		di_flags2 |= XFS_DIFLAG2_FORCEALIGN;
> > > +	if (xflags & FS_XFLAG_ATOMICWRITES)
> > > +		di_flags2 |= XFS_DIFLAG2_ATOMICWRITES;
> > >   	return di_flags2;
> > >   }
> > > @@ -130,6 +132,8 @@ xfs_ip2xflags(
> > >   			flags |= FS_XFLAG_COWEXTSIZE;
> > >   		if (ip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN)
> > >   			flags |= FS_XFLAG_FORCEALIGN;
> > > +		if (ip->i_diflags2 & XFS_DIFLAG2_ATOMICWRITES)
> > > +			flags |= FS_XFLAG_ATOMICWRITES;
> > >   	}
> > >   	if (xfs_inode_has_attr_fork(ip))
> > > diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
> > > index e56911553edd..5de8725bf93a 100644
> > > --- a/fs/xfs/libxfs/xfs_sb.c
> > > +++ b/fs/xfs/libxfs/xfs_sb.c
> > > @@ -166,6 +166,8 @@ xfs_sb_version_to_features(
> > >   		features |= XFS_FEAT_INOBTCNT;
> > >   	if (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_FORCEALIGN)
> > >   		features |= XFS_FEAT_FORCEALIGN;
> > > +	if (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES)
> > > +		features |= XFS_FEAT_ATOMICWRITES;
> > >   	if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_FTYPE)
> > >   		features |= XFS_FEAT_FTYPE;
> > >   	if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_SPINODES)
> > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > > index aa4dbda7b536..44bee3e2b2bb 100644
> > > --- a/fs/xfs/xfs_buf.c
> > > +++ b/fs/xfs/xfs_buf.c
> > > @@ -2060,6 +2060,8 @@ int
> > >   xfs_init_buftarg(
> > >   	struct xfs_buftarg		*btp,
> > >   	size_t				logical_sectorsize,
> > > +	unsigned int			awu_min,
> > > +	unsigned int			awu_max,
> > >   	const char			*descr)
> > >   {
> > >   	/* Set up device logical sector size mask */
> > > @@ -2086,6 +2088,9 @@ xfs_init_buftarg(
> > >   	btp->bt_shrinker->scan_objects = xfs_buftarg_shrink_scan;
> > >   	btp->bt_shrinker->private_data = btp;
> > >   	shrinker_register(btp->bt_shrinker);
> > > +
> > > +	btp->bt_bdev_awu_min = awu_min;
> > > +	btp->bt_bdev_awu_max = awu_max;
> > >   	return 0;
> > >   out_destroy_io_count:
> > > @@ -2102,6 +2107,7 @@ xfs_alloc_buftarg(
> > >   {
> > >   	struct xfs_buftarg	*btp;
> > >   	const struct dax_holder_operations *ops = NULL;
> > > +	unsigned int awu_min = 0, awu_max = 0;
> > >   #if defined(CONFIG_FS_DAX) && defined(CONFIG_MEMORY_FAILURE)
> > >   	ops = &xfs_dax_holder_operations;
> > > @@ -2115,6 +2121,13 @@ xfs_alloc_buftarg(
> > >   	btp->bt_daxdev = fs_dax_get_by_bdev(btp->bt_bdev, &btp->bt_dax_part_off,
> > >   					    mp, ops);
> > > +	if (bdev_can_atomic_write(btp->bt_bdev)) {
> > > +		struct request_queue *q = bdev_get_queue(btp->bt_bdev);
> > > +
> > > +		awu_min = queue_atomic_write_unit_min_bytes(q);
> > > +		awu_max = queue_atomic_write_unit_max_bytes(q);
> > > +	}
> > > +
> > >   	/*
> > >   	 * When allocating the buftargs we have not yet read the super block and
> > >   	 * thus don't know the file system sector size yet.
> > > @@ -2122,7 +2135,7 @@ xfs_alloc_buftarg(
> > >   	if (xfs_setsize_buftarg(btp, bdev_logical_block_size(btp->bt_bdev)))
> > >   		goto error_free;
> > >   	if (xfs_init_buftarg(btp, bdev_logical_block_size(btp->bt_bdev),
> > > -			mp->m_super->s_id))
> > > +			awu_min, awu_max, mp->m_super->s_id))
> > >   		goto error_free;
> > >   	return btp;
> > > diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
> > > index b1580644501f..3bcd8137d739 100644
> > > --- a/fs/xfs/xfs_buf.h
> > > +++ b/fs/xfs/xfs_buf.h
> > > @@ -124,6 +124,8 @@ struct xfs_buftarg {
> > >   	struct percpu_counter	bt_io_count;
> > >   	struct ratelimit_state	bt_ioerror_rl;
> > > +	unsigned int		bt_bdev_awu_min, bt_bdev_awu_max;
> > 
> > Please add a comment here about what these mean.  Not everyone is going
> > to know what "awu" abbreviates.
> 
> sure
> 
> > > +
> > >   static int
> > >   xfs_ioctl_setattr_xflags(
> > >   	struct xfs_trans	*tp,
> > > @@ -511,9 +554,12 @@ xfs_ioctl_setattr_xflags(
> > >   	struct xfs_mount	*mp = ip->i_mount;
> > >   	bool			rtflag = (fa->fsx_xflags & FS_XFLAG_REALTIME);
> > >   	bool			forcealign = fa->fsx_xflags & FS_XFLAG_FORCEALIGN;
> > > +	bool			atomic_writes;
> > >   	uint64_t		i_flags2;
> > >   	int			error;
> > > +	atomic_writes = fa->fsx_xflags & FS_XFLAG_ATOMICWRITES;
> > > +
> > >   	/* Can't change RT or forcealign flags if any extents are allocated. */
> > >   	if (rtflag != XFS_IS_REALTIME_INODE(ip) ||
> > >   	    forcealign != xfs_inode_has_forcealign(ip)) {
> > > @@ -554,6 +600,12 @@ xfs_ioctl_setattr_xflags(
> > >   			return error;
> > >   	}
> > > +	if (atomic_writes) {
> > > +		error = xfs_ioctl_setattr_atomicwrites(ip, fa);
> > > +		if (error)
> > > +			return error;
> > > +	}
> > > +
> > >   	ip->i_diflags = xfs_flags2diflags(ip, fa->fsx_xflags);
> > >   	ip->i_diflags2 = i_flags2;
> > > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> > > index 30228fea908d..0c5a3ae3cdaf 100644
> > > --- a/fs/xfs/xfs_mount.h
> > > +++ b/fs/xfs/xfs_mount.h
> > > @@ -300,6 +300,7 @@ typedef struct xfs_mount {
> > >   #define XFS_FEAT_NREXT64	(1ULL << 26)	/* large extent counters */
> > >   #define XFS_FEAT_EXCHANGE_RANGE	(1ULL << 27)	/* exchange range */
> > >   #define XFS_FEAT_FORCEALIGN	(1ULL << 28)	/* aligned file data extents */
> > > +#define XFS_FEAT_ATOMICWRITES	(1ULL << 29)	/* atomic writes support */
> > >   /* Mount features */
> > >   #define XFS_FEAT_NOATTR2	(1ULL << 48)	/* disable attr2 creation */
> > > @@ -387,6 +388,7 @@ __XFS_HAS_V4_FEAT(v3inodes, V3INODES)
> > >   __XFS_HAS_V4_FEAT(crc, CRC)
> > >   __XFS_HAS_V4_FEAT(pquotino, PQUOTINO)
> > >   __XFS_HAS_FEAT(forcealign, FORCEALIGN)
> > > +__XFS_HAS_FEAT(atomicwrites, ATOMICWRITES)
> > >   /*
> > >    * Mount features
> > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > > index b52a01b50387..5352b90b2bb6 100644
> > > --- a/fs/xfs/xfs_super.c
> > > +++ b/fs/xfs/xfs_super.c
> > > @@ -1721,6 +1721,18 @@ xfs_fs_fill_super(
> > >   		mp->m_features &= ~XFS_FEAT_DISCARD;
> > >   	}
> > > +	if (xfs_has_atomicwrites(mp)) {
> > > +		if (!xfs_has_forcealign(mp)) {
> > > +			xfs_alert(mp,
> > > +	"forcealign required for atomicwrites!");
> > 
> > This (atomicwrites && !forcealign) ought to be checked in the superblock
> > verifier.
> 
> You mean in xfs_fs_validate_params(), right?

xfs_validate_sb_common, where we do all the other ondisk superblock
validation.

--D

> 
> Thanks,
> John
> 
> > 
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 4/7] xfs: Support FS_XFLAG_ATOMICWRITES for forcealign
  2024-08-22 20:38       ` Darrick J. Wong
@ 2024-08-23  8:39         ` John Garry
  2024-08-23 16:03           ` Darrick J. Wong
  0 siblings, 1 reply; 28+ messages in thread
From: John Garry @ 2024-08-23  8:39 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: axboe, brauner, viro, jack, chandan.babu, dchinner, hch,
	linux-block, linux-kernel, linux-xfs, linux-fsdevel, hare,
	martin.petersen, catherine.hoang, kbusch

On 22/08/2024 21:38, Darrick J. Wong wrote:
>>> This (atomicwrites && !forcealign) ought to be checked in the superblock
>>> verifier.
>> You mean in xfs_fs_validate_params(), right?
> xfs_validate_sb_common, where we do all the other ondisk superblock
> validation.

I don't see any other xfs_has_XXX checks in xfs_validate_sb_common(), 
but this could be the first...

The only other place in which I see a pattern of similar SB feature flag 
checks is in xfs_finish_flags() for checking xfs_has_crc() && 
xfs_has_noattr2().

So if we go with xfs_validate_sb_common(), then should the check in 
xfs_fs_fill_super() for xfs_has_forcealign() && 
xfs_has_realtime()/reflink() be relocated to xfs_validate_sb_common() also:

https://lore.kernel.org/linux-xfs/20240813163638.3751939-8-john.g.garry@oracle.com/

Cheers,
John

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 4/7] xfs: Support FS_XFLAG_ATOMICWRITES for forcealign
  2024-08-23  8:39         ` John Garry
@ 2024-08-23 16:03           ` Darrick J. Wong
  0 siblings, 0 replies; 28+ messages in thread
From: Darrick J. Wong @ 2024-08-23 16:03 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, brauner, viro, jack, chandan.babu, dchinner, hch,
	linux-block, linux-kernel, linux-xfs, linux-fsdevel, hare,
	martin.petersen, catherine.hoang, kbusch

On Fri, Aug 23, 2024 at 09:39:44AM +0100, John Garry wrote:
> On 22/08/2024 21:38, Darrick J. Wong wrote:
> > > > This (atomicwrites && !forcealign) ought to be checked in the superblock
> > > > verifier.
> > > You mean in xfs_fs_validate_params(), right?
> > xfs_validate_sb_common, where we do all the other ondisk superblock
> > validation.
> 
> I don't see any other xfs_has_XXX checks in xfs_validate_sb_common(), but
> this could be the first...

The superblock verifier runs at a lower level in the filesystem -- it
checks that the ondisk superblock doesn't contain any inconsistent
fields or impossible feature combinations, etc.  Once the ondisk
superblock is verified, the information there is used to set XFS_FEAT_*
bits in m_features, which is what the xfs_has_* predicates access.

Therefore, you have to look at the raw superblock fields, not the
xfs_has_ predicates:

	if ((sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES) &&
	    !(sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_FORCEALIGN)) {
		xfs_warn(mp, "atomic writes feature requires force align feature.");
		return -EINVAL;
	}

The reason for checking this state here is that atomicwrites absolutely
requires forcealign and that dependency will always be true.

> The only other place in which I see a pattern of similar SB feature flag
> checks is in xfs_finish_flags() for checking xfs_has_crc() &&
> xfs_has_noattr2().
> 
> So if we go with xfs_validate_sb_common(), then should the check in
> xfs_fs_fill_super() for xfs_has_forcealign() && xfs_has_realtime()/reflink()
> be relocated to xfs_validate_sb_common() also:

No.  Contrast the above with (forcealign && !realtime), which at least
in theory is temporary, so that should live in xfs_fs_fill_super.  Or
put another way, xfs_fs_fill_super is where we screen out the kernel
being too stupid to support something it found on disk.

--D

> 
> https://lore.kernel.org/linux-xfs/20240813163638.3751939-8-john.g.garry@oracle.com/
> 
> Cheers,
> John
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v5 5/7] xfs: Support atomic write for statx
  2024-08-17  9:47 [PATCH v5 0/7] block atomic writes for xfs John Garry
                   ` (3 preceding siblings ...)
  2024-08-17  9:47 ` [PATCH v5 4/7] xfs: Support FS_XFLAG_ATOMICWRITES for forcealign John Garry
@ 2024-08-17  9:47 ` John Garry
  2024-08-21 17:09   ` Darrick J. Wong
  2024-08-17  9:47 ` [PATCH v5 6/7] xfs: Validate atomic writes John Garry
  2024-08-17  9:48 ` [PATCH v5 7/7] xfs: Support setting FMODE_CAN_ATOMIC_WRITE John Garry
  6 siblings, 1 reply; 28+ messages in thread
From: John Garry @ 2024-08-17  9:47 UTC (permalink / raw)
  To: axboe, brauner, djwong, viro, jack, chandan.babu, dchinner, hch
  Cc: linux-block, linux-kernel, linux-xfs, linux-fsdevel, hare,
	martin.petersen, catherine.hoang, kbusch, John Garry

Support providing info on atomic write unit min and max for an inode.

For simplicity, currently we limit the min at the FS block size, but a
lower limit could be supported in future. This is required by iomap
DIO.

The atomic write unit min and max is limited by the guaranteed extent
alignment for the inode.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/xfs_iops.c | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 6e017aa6f61d..c20becd3a7c9 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -570,6 +570,27 @@ xfs_stat_blksize(
 	return PAGE_SIZE;
 }
 
+static void
+xfs_get_atomic_write_attr(
+	struct xfs_inode	*ip,
+	unsigned int		*unit_min,
+	unsigned int		*unit_max)
+{
+	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_sb		*sbp = &mp->m_sb;
+	unsigned int		extsz_bytes = XFS_FSB_TO_B(mp, ip->i_extsize);
+
+	if (!xfs_inode_has_atomicwrites(ip)) {
+		*unit_min = 0;
+		*unit_max = 0;
+		return;
+	}
+
+	*unit_min = sbp->sb_blocksize;
+	*unit_max = min(target->bt_bdev_awu_max, extsz_bytes);
+}
+
 STATIC int
 xfs_vn_getattr(
 	struct mnt_idmap	*idmap,
@@ -643,6 +664,13 @@ xfs_vn_getattr(
 			stat->dio_mem_align = bdev_dma_alignment(bdev) + 1;
 			stat->dio_offset_align = bdev_logical_block_size(bdev);
 		}
+		if (request_mask & STATX_WRITE_ATOMIC) {
+			unsigned int unit_min, unit_max;
+
+			xfs_get_atomic_write_attr(ip, &unit_min, &unit_max);
+			generic_fill_statx_atomic_writes(stat,
+				unit_min, unit_max);
+		}
 		fallthrough;
 	default:
 		stat->blksize = xfs_stat_blksize(ip);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 5/7] xfs: Support atomic write for statx
  2024-08-17  9:47 ` [PATCH v5 5/7] xfs: Support atomic write for statx John Garry
@ 2024-08-21 17:09   ` Darrick J. Wong
  0 siblings, 0 replies; 28+ messages in thread
From: Darrick J. Wong @ 2024-08-21 17:09 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, brauner, viro, jack, chandan.babu, dchinner, hch,
	linux-block, linux-kernel, linux-xfs, linux-fsdevel, hare,
	martin.petersen, catherine.hoang, kbusch

On Sat, Aug 17, 2024 at 09:47:58AM +0000, John Garry wrote:
> Support providing info on atomic write unit min and max for an inode.
> 
> For simplicity, currently we limit the min at the FS block size, but a
> lower limit could be supported in future. This is required by iomap
> DIO.
> 
> The atomic write unit min and max is limited by the guaranteed extent
> alignment for the inode.
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  fs/xfs/xfs_iops.c | 28 ++++++++++++++++++++++++++++
>  1 file changed, 28 insertions(+)
> 
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 6e017aa6f61d..c20becd3a7c9 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -570,6 +570,27 @@ xfs_stat_blksize(
>  	return PAGE_SIZE;
>  }
>  
> +static void
> +xfs_get_atomic_write_attr(
> +	struct xfs_inode	*ip,
> +	unsigned int		*unit_min,
> +	unsigned int		*unit_max)
> +{
> +	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
> +	struct xfs_mount	*mp = ip->i_mount;
> +	struct xfs_sb		*sbp = &mp->m_sb;
> +	unsigned int		extsz_bytes = XFS_FSB_TO_B(mp, ip->i_extsize);
> +
> +	if (!xfs_inode_has_atomicwrites(ip)) {

Here's where you might want to check that DIFLAG2_ATOMICWRITES is sset
and stripe alignment still makes sense before handing out nonzero
STATX_WRITE_ATOMIC information.  The rest of the patch looks ok to me.

--D

> +		*unit_min = 0;
> +		*unit_max = 0;
> +		return;
> +	}
> +
> +	*unit_min = sbp->sb_blocksize;
> +	*unit_max = min(target->bt_bdev_awu_max, extsz_bytes);
> +}
> +
>  STATIC int
>  xfs_vn_getattr(
>  	struct mnt_idmap	*idmap,
> @@ -643,6 +664,13 @@ xfs_vn_getattr(
>  			stat->dio_mem_align = bdev_dma_alignment(bdev) + 1;
>  			stat->dio_offset_align = bdev_logical_block_size(bdev);
>  		}
> +		if (request_mask & STATX_WRITE_ATOMIC) {
> +			unsigned int unit_min, unit_max;
> +
> +			xfs_get_atomic_write_attr(ip, &unit_min, &unit_max);
> +			generic_fill_statx_atomic_writes(stat,
> +				unit_min, unit_max);
> +		}
>  		fallthrough;
>  	default:
>  		stat->blksize = xfs_stat_blksize(ip);
> -- 
> 2.31.1
> 
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v5 6/7] xfs: Validate atomic writes
  2024-08-17  9:47 [PATCH v5 0/7] block atomic writes for xfs John Garry
                   ` (4 preceding siblings ...)
  2024-08-17  9:47 ` [PATCH v5 5/7] xfs: Support atomic write for statx John Garry
@ 2024-08-17  9:47 ` John Garry
  2024-08-21 17:10   ` Darrick J. Wong
  2024-08-17  9:48 ` [PATCH v5 7/7] xfs: Support setting FMODE_CAN_ATOMIC_WRITE John Garry
  6 siblings, 1 reply; 28+ messages in thread
From: John Garry @ 2024-08-17  9:47 UTC (permalink / raw)
  To: axboe, brauner, djwong, viro, jack, chandan.babu, dchinner, hch
  Cc: linux-block, linux-kernel, linux-xfs, linux-fsdevel, hare,
	martin.petersen, catherine.hoang, kbusch, John Garry

Validate that an atomic write adheres to length/offset rules. Since we
require extent alignment for atomic writes, this effectively also enforces
that the BIO which iomap produces is aligned.

For an IOCB with IOCB_ATOMIC set to get as far as xfs_file_dio_write(),
FMODE_CAN_ATOMIC_WRITE will need to be set for the file; for this,
FORCEALIGN and also ATOMICWRITES flags would also need to be set for the
inode.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/xfs_file.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 4cdc54dc9686..9b6530a4eb4a 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -684,9 +684,20 @@ xfs_file_dio_write(
 	struct kiocb		*iocb,
 	struct iov_iter		*from)
 {
-	struct xfs_inode	*ip = XFS_I(file_inode(iocb->ki_filp));
+	struct inode		*inode = file_inode(iocb->ki_filp);
+	struct xfs_inode	*ip = XFS_I(inode);
 	struct xfs_buftarg      *target = xfs_inode_buftarg(ip);
 	size_t			count = iov_iter_count(from);
+	struct xfs_mount	*mp = ip->i_mount;
+
+	if (iocb->ki_flags & IOCB_ATOMIC) {
+		if (count < i_blocksize(inode))
+			return -EINVAL;
+		if (count > XFS_FSB_TO_B(mp, ip->i_extsize))
+			return -EINVAL;
+		if (!generic_atomic_write_valid(iocb, from))
+			return -EINVAL;
+	}
 
 	/* direct I/O must be aligned to device logical sector size */
 	if ((iocb->ki_pos | count) & target->bt_logical_sectormask)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 6/7] xfs: Validate atomic writes
  2024-08-17  9:47 ` [PATCH v5 6/7] xfs: Validate atomic writes John Garry
@ 2024-08-21 17:10   ` Darrick J. Wong
  0 siblings, 0 replies; 28+ messages in thread
From: Darrick J. Wong @ 2024-08-21 17:10 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, brauner, viro, jack, chandan.babu, dchinner, hch,
	linux-block, linux-kernel, linux-xfs, linux-fsdevel, hare,
	martin.petersen, catherine.hoang, kbusch

On Sat, Aug 17, 2024 at 09:47:59AM +0000, John Garry wrote:
> Validate that an atomic write adheres to length/offset rules. Since we
> require extent alignment for atomic writes, this effectively also enforces
> that the BIO which iomap produces is aligned.
> 
> For an IOCB with IOCB_ATOMIC set to get as far as xfs_file_dio_write(),
> FMODE_CAN_ATOMIC_WRITE will need to be set for the file; for this,
> FORCEALIGN and also ATOMICWRITES flags would also need to be set for the
> inode.
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  fs/xfs/xfs_file.c | 13 ++++++++++++-
>  1 file changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 4cdc54dc9686..9b6530a4eb4a 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -684,9 +684,20 @@ xfs_file_dio_write(
>  	struct kiocb		*iocb,
>  	struct iov_iter		*from)
>  {
> -	struct xfs_inode	*ip = XFS_I(file_inode(iocb->ki_filp));
> +	struct inode		*inode = file_inode(iocb->ki_filp);
> +	struct xfs_inode	*ip = XFS_I(inode);
>  	struct xfs_buftarg      *target = xfs_inode_buftarg(ip);
>  	size_t			count = iov_iter_count(from);
> +	struct xfs_mount	*mp = ip->i_mount;
> +
> +	if (iocb->ki_flags & IOCB_ATOMIC) {
> +		if (count < i_blocksize(inode))
> +			return -EINVAL;
> +		if (count > XFS_FSB_TO_B(mp, ip->i_extsize))
> +			return -EINVAL;

Here's also the place to check the dynamic things like dalign/swidth
alignment.  Other than that, this looks good.

--D

> +		if (!generic_atomic_write_valid(iocb, from))
> +			return -EINVAL;
> +	}
>  
>  	/* direct I/O must be aligned to device logical sector size */
>  	if ((iocb->ki_pos | count) & target->bt_logical_sectormask)
> -- 
> 2.31.1
> 
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v5 7/7] xfs: Support setting FMODE_CAN_ATOMIC_WRITE
  2024-08-17  9:47 [PATCH v5 0/7] block atomic writes for xfs John Garry
                   ` (5 preceding siblings ...)
  2024-08-17  9:47 ` [PATCH v5 6/7] xfs: Validate atomic writes John Garry
@ 2024-08-17  9:48 ` John Garry
  2024-08-21 17:11   ` Darrick J. Wong
  6 siblings, 1 reply; 28+ messages in thread
From: John Garry @ 2024-08-17  9:48 UTC (permalink / raw)
  To: axboe, brauner, djwong, viro, jack, chandan.babu, dchinner, hch
  Cc: linux-block, linux-kernel, linux-xfs, linux-fsdevel, hare,
	martin.petersen, catherine.hoang, kbusch, John Garry

For when an inode is enabled for atomic writes, set FMODE_CAN_ATOMIC_WRITE
flag. Only direct IO is currently supported, so check for that also.

We rely on the block layer to reject atomic writes which exceed the bdev
request_queue limits, so don't bother checking any such thing here.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/xfs_file.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 9b6530a4eb4a..3489d478809e 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1149,6 +1149,18 @@ xfs_file_remap_range(
 	return remapped > 0 ? remapped : ret;
 }
 
+static bool xfs_file_open_can_atomicwrite(
+	struct inode		*inode,
+	struct file		*file)
+{
+	struct xfs_inode	*ip = XFS_I(inode);
+
+	if (!(file->f_flags & O_DIRECT))
+		return false;
+
+	return xfs_inode_has_atomicwrites(ip);
+}
+
 STATIC int
 xfs_file_open(
 	struct inode	*inode,
@@ -1157,6 +1169,8 @@ xfs_file_open(
 	if (xfs_is_shutdown(XFS_M(inode->i_sb)))
 		return -EIO;
 	file->f_mode |= FMODE_NOWAIT | FMODE_CAN_ODIRECT;
+	if (xfs_file_open_can_atomicwrite(inode, file))
+		file->f_mode |= FMODE_CAN_ATOMIC_WRITE;
 	return generic_file_open(inode, file);
 }
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 7/7] xfs: Support setting FMODE_CAN_ATOMIC_WRITE
  2024-08-17  9:48 ` [PATCH v5 7/7] xfs: Support setting FMODE_CAN_ATOMIC_WRITE John Garry
@ 2024-08-21 17:11   ` Darrick J. Wong
  2024-08-22 18:04     ` John Garry
  0 siblings, 1 reply; 28+ messages in thread
From: Darrick J. Wong @ 2024-08-21 17:11 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, brauner, viro, jack, chandan.babu, dchinner, hch,
	linux-block, linux-kernel, linux-xfs, linux-fsdevel, hare,
	martin.petersen, catherine.hoang, kbusch

On Sat, Aug 17, 2024 at 09:48:00AM +0000, John Garry wrote:
> For when an inode is enabled for atomic writes, set FMODE_CAN_ATOMIC_WRITE
> flag. Only direct IO is currently supported, so check for that also.
> 
> We rely on the block layer to reject atomic writes which exceed the bdev
> request_queue limits, so don't bother checking any such thing here.
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  fs/xfs/xfs_file.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 9b6530a4eb4a..3489d478809e 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1149,6 +1149,18 @@ xfs_file_remap_range(
>  	return remapped > 0 ? remapped : ret;
>  }
>  
> +static bool xfs_file_open_can_atomicwrite(
> +	struct inode		*inode,
> +	struct file		*file)
> +{
> +	struct xfs_inode	*ip = XFS_I(inode);
> +
> +	if (!(file->f_flags & O_DIRECT))
> +		return false;
> +
> +	return xfs_inode_has_atomicwrites(ip);

...and here too.  I do like the shift to having an incore flag that
controls whether you get untorn write support or not.

--D

> +}
> +
>  STATIC int
>  xfs_file_open(
>  	struct inode	*inode,
> @@ -1157,6 +1169,8 @@ xfs_file_open(
>  	if (xfs_is_shutdown(XFS_M(inode->i_sb)))
>  		return -EIO;
>  	file->f_mode |= FMODE_NOWAIT | FMODE_CAN_ODIRECT;
> +	if (xfs_file_open_can_atomicwrite(inode, file))
> +		file->f_mode |= FMODE_CAN_ATOMIC_WRITE;
>  	return generic_file_open(inode, file);
>  }
>  
> -- 
> 2.31.1
> 
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 7/7] xfs: Support setting FMODE_CAN_ATOMIC_WRITE
  2024-08-21 17:11   ` Darrick J. Wong
@ 2024-08-22 18:04     ` John Garry
  2024-08-22 20:44       ` Darrick J. Wong
  0 siblings, 1 reply; 28+ messages in thread
From: John Garry @ 2024-08-22 18:04 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: axboe, brauner, viro, jack, chandan.babu, dchinner, hch,
	linux-block, linux-kernel, linux-xfs, linux-fsdevel, hare,
	martin.petersen, catherine.hoang, kbusch

On 21/08/2024 18:11, Darrick J. Wong wrote:
> On Sat, Aug 17, 2024 at 09:48:00AM +0000, John Garry wrote:
>> For when an inode is enabled for atomic writes, set FMODE_CAN_ATOMIC_WRITE
>> flag. Only direct IO is currently supported, so check for that also.
>>
>> We rely on the block layer to reject atomic writes which exceed the bdev
>> request_queue limits, so don't bother checking any such thing here.
>>
>> Signed-off-by: John Garry <john.g.garry@oracle.com>
>> ---
>>   fs/xfs/xfs_file.c | 14 ++++++++++++++
>>   1 file changed, 14 insertions(+)
>>
>> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
>> index 9b6530a4eb4a..3489d478809e 100644
>> --- a/fs/xfs/xfs_file.c
>> +++ b/fs/xfs/xfs_file.c
>> @@ -1149,6 +1149,18 @@ xfs_file_remap_range(
>>   	return remapped > 0 ? remapped : ret;
>>   }
>>   
>> +static bool xfs_file_open_can_atomicwrite(
>> +	struct inode		*inode,
>> +	struct file		*file)
>> +{
>> +	struct xfs_inode	*ip = XFS_I(inode);
>> +
>> +	if (!(file->f_flags & O_DIRECT))
>> +		return false;
>> +
>> +	return xfs_inode_has_atomicwrites(ip);
> 
> ...and here too.  I do like the shift to having an incore flag that
> controls whether you get untorn write support or not.

Do you mean that add a new member to xfs_inode to record this? If yes, 
it sounds ok, but we need to maintain consistency (of that member) 
whenever anything which can affect it changes, which is always a bit 
painful.

John

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 7/7] xfs: Support setting FMODE_CAN_ATOMIC_WRITE
  2024-08-22 18:04     ` John Garry
@ 2024-08-22 20:44       ` Darrick J. Wong
  2024-08-23 10:41         ` John Garry
  0 siblings, 1 reply; 28+ messages in thread
From: Darrick J. Wong @ 2024-08-22 20:44 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, brauner, viro, jack, chandan.babu, dchinner, hch,
	linux-block, linux-kernel, linux-xfs, linux-fsdevel, hare,
	martin.petersen, catherine.hoang, kbusch

On Thu, Aug 22, 2024 at 07:04:02PM +0100, John Garry wrote:
> On 21/08/2024 18:11, Darrick J. Wong wrote:
> > On Sat, Aug 17, 2024 at 09:48:00AM +0000, John Garry wrote:
> > > For when an inode is enabled for atomic writes, set FMODE_CAN_ATOMIC_WRITE
> > > flag. Only direct IO is currently supported, so check for that also.
> > > 
> > > We rely on the block layer to reject atomic writes which exceed the bdev
> > > request_queue limits, so don't bother checking any such thing here.
> > > 
> > > Signed-off-by: John Garry <john.g.garry@oracle.com>
> > > ---
> > >   fs/xfs/xfs_file.c | 14 ++++++++++++++
> > >   1 file changed, 14 insertions(+)
> > > 
> > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > > index 9b6530a4eb4a..3489d478809e 100644
> > > --- a/fs/xfs/xfs_file.c
> > > +++ b/fs/xfs/xfs_file.c
> > > @@ -1149,6 +1149,18 @@ xfs_file_remap_range(
> > >   	return remapped > 0 ? remapped : ret;
> > >   }
> > > +static bool xfs_file_open_can_atomicwrite(
> > > +	struct inode		*inode,
> > > +	struct file		*file)
> > > +{
> > > +	struct xfs_inode	*ip = XFS_I(inode);
> > > +
> > > +	if (!(file->f_flags & O_DIRECT))
> > > +		return false;
> > > +
> > > +	return xfs_inode_has_atomicwrites(ip);
> > 
> > ...and here too.  I do like the shift to having an incore flag that
> > controls whether you get untorn write support or not.
> 
> Do you mean that add a new member to xfs_inode to record this? If yes, it
> sounds ok, but we need to maintain consistency (of that member) whenever
> anything which can affect it changes, which is always a bit painful.

I actually meant something more like:

static bool
xfs_file_open_can_atomicwrite(
	struct file		*file,
	struct inode		*inode)
{
	struct xfs_inode	*ip = XFS_I(inode);
	struct xfs_mount	*mp = ip->i_mount;
	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);

	if (!(file->f_flags & O_DIRECT))
		return false;
	if (!xfs_inode_has_atomicwrites(ip))
		return false;
	if (mp->m_dalign && (mp->m_dalign % ip->i_extsize))
		return false;
	if (mp->m_swidth && (mp->m_swidth % ip->i_extsize))
		return false;
	if (mp->m_sb.sb_blocksize < target->bt_bdev_awu_min)
		return false;
	if (xfs_inode_alloc_unitsize(ip) > target->bt_bdev_awu_max)
		return false;
	return true;
}

--D

> John
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 7/7] xfs: Support setting FMODE_CAN_ATOMIC_WRITE
  2024-08-22 20:44       ` Darrick J. Wong
@ 2024-08-23 10:41         ` John Garry
  2024-08-23 15:52           ` Darrick J. Wong
  0 siblings, 1 reply; 28+ messages in thread
From: John Garry @ 2024-08-23 10:41 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: axboe, brauner, viro, jack, chandan.babu, dchinner, hch,
	linux-block, linux-kernel, linux-xfs, linux-fsdevel, hare,
	martin.petersen, catherine.hoang, kbusch

On 22/08/2024 21:44, Darrick J. Wong wrote:
>> Do you mean that add a new member to xfs_inode to record this? If yes, it
>> sounds ok, but we need to maintain consistency (of that member) whenever
>> anything which can affect it changes, which is always a bit painful.
> I actually meant something more like:
> 
> static bool
> xfs_file_open_can_atomicwrite(
> 	struct file		*file,
> 	struct inode		*inode)
> {
> 	struct xfs_inode	*ip = XFS_I(inode);
> 	struct xfs_mount	*mp = ip->i_mount;
> 	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
> 
> 	if (!(file->f_flags & O_DIRECT))
> 		return false;
> 	if (!xfs_inode_has_atomicwrites(ip))
> 		return false;
> 	if (mp->m_dalign && (mp->m_dalign % ip->i_extsize))
> 		return false;
> 	if (mp->m_swidth && (mp->m_swidth % ip->i_extsize))
> 		return false;
> 	if (mp->m_sb.sb_blocksize < target->bt_bdev_awu_min)
> 		return false;
> 	if (xfs_inode_alloc_unitsize(ip) > target->bt_bdev_awu_max)
> 		return false;
> 	return true;
> }

ok, but we should probably factor out some duplicated code with helpers, 
like:

bool xfs_validate_atomicwrites_extsize(struct xfs_mount *mp, uint32_t 
extsize)
{
	if (!is_power_of_2(extsize))
		return false;

	/* Required to guarantee data block alignment */
	if (mp->m_sb.sb_agblocks % extsize)
		return false;

	/* Requires stripe unit+width be a multiple of extsize */
	if (mp->m_dalign && (mp->m_dalign % extsize))
		return false;

	if (mp->m_swidth && (mp->m_swidth % extsize))
		return false;

	return true;
}


bool xfs_inode_has_atomicwrites(struct xfs_inode *ip)
{
	struct xfs_mount	*mp = ip->i_mount;
	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);

	if (!(ip->i_diflags2 & XFS_DIFLAG2_ATOMICWRITES))
		return false;
	if (!xfs_validate_atomicwrites_extsize(mp, ip->i_extsize))
		return false;
	if (mp->m_sb.sb_blocksize < target->bt_bdev_awu_min)
		return false;
	if (xfs_inode_alloc_unitsize(ip) > target->bt_bdev_awu_max)
		return false;
	return true;
}


static bool xfs_file_open_can_atomicwrite(
	struct inode		*inode,
	struct file		*file)
{
	struct xfs_inode	*ip = XFS_I(inode);

	if (!(file->f_flags & O_DIRECT))
		return false;
	return xfs_inode_has_atomicwrites(ip);
}

Those helpers can be re-used in xfs_inode_validate_atomicwrites() and 
xfs_ioctl_setattr_atomicwrites().


John


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 7/7] xfs: Support setting FMODE_CAN_ATOMIC_WRITE
  2024-08-23 10:41         ` John Garry
@ 2024-08-23 15:52           ` Darrick J. Wong
  0 siblings, 0 replies; 28+ messages in thread
From: Darrick J. Wong @ 2024-08-23 15:52 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, brauner, viro, jack, chandan.babu, dchinner, hch,
	linux-block, linux-kernel, linux-xfs, linux-fsdevel, hare,
	martin.petersen, catherine.hoang, kbusch

On Fri, Aug 23, 2024 at 11:41:07AM +0100, John Garry wrote:
> On 22/08/2024 21:44, Darrick J. Wong wrote:
> > > Do you mean that add a new member to xfs_inode to record this? If yes, it
> > > sounds ok, but we need to maintain consistency (of that member) whenever
> > > anything which can affect it changes, which is always a bit painful.
> > I actually meant something more like:
> > 
> > static bool
> > xfs_file_open_can_atomicwrite(
> > 	struct file		*file,
> > 	struct inode		*inode)
> > {
> > 	struct xfs_inode	*ip = XFS_I(inode);
> > 	struct xfs_mount	*mp = ip->i_mount;
> > 	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
> > 
> > 	if (!(file->f_flags & O_DIRECT))
> > 		return false;
> > 	if (!xfs_inode_has_atomicwrites(ip))
> > 		return false;
> > 	if (mp->m_dalign && (mp->m_dalign % ip->i_extsize))
> > 		return false;
> > 	if (mp->m_swidth && (mp->m_swidth % ip->i_extsize))
> > 		return false;
> > 	if (mp->m_sb.sb_blocksize < target->bt_bdev_awu_min)
> > 		return false;
> > 	if (xfs_inode_alloc_unitsize(ip) > target->bt_bdev_awu_max)
> > 		return false;
> > 	return true;
> > }
> 
> ok, but we should probably factor out some duplicated code with helpers,
> like:
> 
> bool xfs_validate_atomicwrites_extsize(struct xfs_mount *mp, uint32_t
> extsize)

xfs_agblock_t extsize, but other than that this looks right to me.

> {
> 	if (!is_power_of_2(extsize))
> 		return false;
> 
> 	/* Required to guarantee data block alignment */
> 	if (mp->m_sb.sb_agblocks % extsize)
> 		return false;
> 
> 	/* Requires stripe unit+width be a multiple of extsize */
> 	if (mp->m_dalign && (mp->m_dalign % extsize))
> 		return false;
> 
> 	if (mp->m_swidth && (mp->m_swidth % extsize))
> 		return false;
> 
> 	return true;
> }
> 
> 
> bool xfs_inode_has_atomicwrites(struct xfs_inode *ip)
> {
> 	struct xfs_mount	*mp = ip->i_mount;
> 	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
> 
> 	if (!(ip->i_diflags2 & XFS_DIFLAG2_ATOMICWRITES))
> 		return false;
> 	if (!xfs_validate_atomicwrites_extsize(mp, ip->i_extsize))
> 		return false;
> 	if (mp->m_sb.sb_blocksize < target->bt_bdev_awu_min)
> 		return false;
> 	if (xfs_inode_alloc_unitsize(ip) > target->bt_bdev_awu_max)
> 		return false;
> 	return true;
> }
> 
> 
> static bool xfs_file_open_can_atomicwrite(
> 	struct inode		*inode,
> 	struct file		*file)
> {
> 	struct xfs_inode	*ip = XFS_I(inode);
> 
> 	if (!(file->f_flags & O_DIRECT))
> 		return false;
> 	return xfs_inode_has_atomicwrites(ip);
> }
> 
> Those helpers can be re-used in xfs_inode_validate_atomicwrites() and
> xfs_ioctl_setattr_atomicwrites().

Looks good to me.

--D

> 
> John
> 
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2024-09-03 12:43 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-17  9:47 [PATCH v5 0/7] block atomic writes for xfs John Garry
2024-08-17  9:47 ` [PATCH v5 1/7] block/fs: Pass an iocb to generic_atomic_write_valid() John Garry
2024-08-20 17:28   ` Darrick J. Wong
2024-08-17  9:47 ` [PATCH v5 2/7] fs: Export generic_atomic_write_valid() John Garry
2024-08-20 17:28   ` Darrick J. Wong
2024-08-17  9:47 ` [PATCH v5 3/7] fs: iomap: Atomic write support John Garry
2024-08-21 16:58   ` Darrick J. Wong
2024-08-22 15:29     ` John Garry
2024-08-22 20:30       ` Darrick J. Wong
2024-08-30 15:48         ` John Garry
2024-08-30 23:56           ` Darrick J. Wong
2024-09-03 12:43             ` John Garry
2024-08-17  9:47 ` [PATCH v5 4/7] xfs: Support FS_XFLAG_ATOMICWRITES for forcealign John Garry
2024-08-21 17:07   ` Darrick J. Wong
2024-08-22 17:45     ` John Garry
2024-08-22 20:38       ` Darrick J. Wong
2024-08-23  8:39         ` John Garry
2024-08-23 16:03           ` Darrick J. Wong
2024-08-17  9:47 ` [PATCH v5 5/7] xfs: Support atomic write for statx John Garry
2024-08-21 17:09   ` Darrick J. Wong
2024-08-17  9:47 ` [PATCH v5 6/7] xfs: Validate atomic writes John Garry
2024-08-21 17:10   ` Darrick J. Wong
2024-08-17  9:48 ` [PATCH v5 7/7] xfs: Support setting FMODE_CAN_ATOMIC_WRITE John Garry
2024-08-21 17:11   ` Darrick J. Wong
2024-08-22 18:04     ` John Garry
2024-08-22 20:44       ` Darrick J. Wong
2024-08-23 10:41         ` John Garry
2024-08-23 15:52           ` Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).