public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* RFC: iomap patches for zoned XFS
@ 2024-12-11  8:53 Christoph Hellwig
  2024-12-11  8:53 ` [PATCH 1/8] iomap: allow the file system to submit the writeback bios Christoph Hellwig
                   ` (8 more replies)
  0 siblings, 9 replies; 30+ messages in thread
From: Christoph Hellwig @ 2024-12-11  8:53 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Darrick J. Wong, Carlos Maiolino, linux-xfs, linux-fsdevel

Hi all,

this series contains the iomap prep work to support zoned XFS.

The biggest changes are:

 - an option to reuse the ioend code for direct writes in addition to the
   current use for buffered writeback, which allows the file system to
   track completions on a per-bio basis instead of the current end_io
   callback which operates on the entire I/O.
   Note that it might make sense to split the ioend code from
   buffered-io.c into its own file with this.  Let me know what you think
   of that and I can include it in the next version
 - change of the writeback_ops so that the submit_bio call can be done by
   the file system.  Note that btrfs will also need this eventually when
   it starts using iomap
 - helpers to split ioend to the zone append queue_limits that plug
   into the previous item above.
 - a bunch of changes for slightly different merge conditions when using
   zone append.  Note that btrfs wants something similar also for
   compressed I/O, which might be able to share some code.  For now
   the flags use zone append naming, but we can change that if it gets
   used elsewhere.
 - passing private data to a few more helper

The XFS changes to use this will be posted to the xfs list only to not
spam fsdevel too much.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 1/8] iomap: allow the file system to submit the writeback bios
  2024-12-11  8:53 RFC: iomap patches for zoned XFS Christoph Hellwig
@ 2024-12-11  8:53 ` Christoph Hellwig
  2024-12-12 17:48   ` Darrick J. Wong
  2024-12-11  8:53 ` [PATCH 2/8] iomap: simplify io_flags and io_type in struct iomap_ioend Christoph Hellwig
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 30+ messages in thread
From: Christoph Hellwig @ 2024-12-11  8:53 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Darrick J. Wong, Carlos Maiolino, linux-xfs, linux-fsdevel

Change ->prepare_ioend to ->submit_ioend and require file systems that
implement it to submit the bio.  This is needed for file systems that
do their own work on the bios before submitting them to the block layer
like btrfs or zoned xfs.  To make this easier also pass the writeback
context to the method.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap/buffered-io.c | 10 +++++-----
 fs/xfs/xfs_aops.c      | 13 +++++++++----
 include/linux/iomap.h  | 12 +++++++-----
 3 files changed, 21 insertions(+), 14 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 955f19e27e47..cdccf11bb3be 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1675,7 +1675,7 @@ static void iomap_writepage_end_bio(struct bio *bio)
 }
 
 /*
- * Submit the final bio for an ioend.
+ * Submit an ioend.
  *
  * If @error is non-zero, it means that we have a situation where some part of
  * the submission process has failed after we've marked pages for writeback.
@@ -1694,14 +1694,14 @@ static int iomap_submit_ioend(struct iomap_writepage_ctx *wpc, int error)
 	 * failure happened so that the file system end I/O handler gets called
 	 * to clean up.
 	 */
-	if (wpc->ops->prepare_ioend)
-		error = wpc->ops->prepare_ioend(wpc->ioend, error);
+	if (wpc->ops->submit_ioend)
+		error = wpc->ops->submit_ioend(wpc, error);
+	else if (!error)
+		submit_bio(&wpc->ioend->io_bio);
 
 	if (error) {
 		wpc->ioend->io_bio.bi_status = errno_to_blk_status(error);
 		bio_endio(&wpc->ioend->io_bio);
-	} else {
-		submit_bio(&wpc->ioend->io_bio);
 	}
 
 	wpc->ioend = NULL;
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 559a3a577097..d175853da5ae 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -395,10 +395,11 @@ xfs_map_blocks(
 }
 
 static int
-xfs_prepare_ioend(
-	struct iomap_ioend	*ioend,
+xfs_submit_ioend(
+	struct iomap_writepage_ctx *wpc,
 	int			status)
 {
+	struct iomap_ioend	*ioend = wpc->ioend;
 	unsigned int		nofs_flag;
 
 	/*
@@ -420,7 +421,11 @@ xfs_prepare_ioend(
 	if (xfs_ioend_is_append(ioend) || ioend->io_type == IOMAP_UNWRITTEN ||
 	    (ioend->io_flags & IOMAP_F_SHARED))
 		ioend->io_bio.bi_end_io = xfs_end_bio;
-	return status;
+
+	if (status)
+		return status;
+	submit_bio(&ioend->io_bio);
+	return 0;
 }
 
 /*
@@ -462,7 +467,7 @@ xfs_discard_folio(
 
 static const struct iomap_writeback_ops xfs_writeback_ops = {
 	.map_blocks		= xfs_map_blocks,
-	.prepare_ioend		= xfs_prepare_ioend,
+	.submit_ioend		= xfs_submit_ioend,
 	.discard_folio		= xfs_discard_folio,
 };
 
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 5675af6b740c..c0339678d798 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -362,12 +362,14 @@ struct iomap_writeback_ops {
 			  loff_t offset, unsigned len);
 
 	/*
-	 * Optional, allows the file systems to perform actions just before
-	 * submitting the bio and/or override the bio end_io handler for complex
-	 * operations like copy on write extent manipulation or unwritten extent
-	 * conversions.
+	 * Optional, allows the file systems to hook into bio submission,
+	 * including overriding the bi_end_io handler.
+	 *
+	 * Returns 0 if the bio was successfully submitted, or a negative
+	 * error code if status was non-zero or another error happened and
+	 * the bio could not be submitted.
 	 */
-	int (*prepare_ioend)(struct iomap_ioend *ioend, int status);
+	int (*submit_ioend)(struct iomap_writepage_ctx *wpc, int status);
 
 	/*
 	 * Optional, allows the file system to discard state on a page where
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 2/8] iomap: simplify io_flags and io_type in struct iomap_ioend
  2024-12-11  8:53 RFC: iomap patches for zoned XFS Christoph Hellwig
  2024-12-11  8:53 ` [PATCH 1/8] iomap: allow the file system to submit the writeback bios Christoph Hellwig
@ 2024-12-11  8:53 ` Christoph Hellwig
  2024-12-12 17:55   ` Darrick J. Wong
  2024-12-11  8:53 ` [PATCH 3/8] iomap: add a IOMAP_F_ZONE_APPEND flag Christoph Hellwig
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 30+ messages in thread
From: Christoph Hellwig @ 2024-12-11  8:53 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Darrick J. Wong, Carlos Maiolino, linux-xfs, linux-fsdevel

The ioend fields for distinct types of I/O are a bit complicated.
Consolidate them into a single io_flag field with it's own flags
decoupled from the iomap flags.  This also prepares for adding a new
flag that is unrelated to both of the iomap namespaces.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap/buffered-io.c | 39 ++++++++++++++++++++++-----------------
 fs/xfs/xfs_aops.c      | 12 ++++++------
 include/linux/iomap.h  | 16 ++++++++++++++--
 3 files changed, 42 insertions(+), 25 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index cdccf11bb3be..3176dc996fb7 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1605,13 +1605,10 @@ iomap_ioend_can_merge(struct iomap_ioend *ioend, struct iomap_ioend *next)
 {
 	if (ioend->io_bio.bi_status != next->io_bio.bi_status)
 		return false;
-	if (next->io_flags & IOMAP_F_BOUNDARY)
+	if (next->io_flags & IOMAP_IOEND_BOUNDARY)
 		return false;
-	if ((ioend->io_flags & IOMAP_F_SHARED) ^
-	    (next->io_flags & IOMAP_F_SHARED))
-		return false;
-	if ((ioend->io_type == IOMAP_UNWRITTEN) ^
-	    (next->io_type == IOMAP_UNWRITTEN))
+	if ((ioend->io_flags & IOMAP_IOEND_NOMERGE_FLAGS) !=
+	    (next->io_flags & IOMAP_IOEND_NOMERGE_FLAGS))
 		return false;
 	if (ioend->io_offset + ioend->io_size != next->io_offset)
 		return false;
@@ -1709,7 +1706,8 @@ static int iomap_submit_ioend(struct iomap_writepage_ctx *wpc, int error)
 }
 
 static struct iomap_ioend *iomap_alloc_ioend(struct iomap_writepage_ctx *wpc,
-		struct writeback_control *wbc, struct inode *inode, loff_t pos)
+		struct writeback_control *wbc, struct inode *inode, loff_t pos,
+		u16 ioend_flags)
 {
 	struct iomap_ioend *ioend;
 	struct bio *bio;
@@ -1724,8 +1722,7 @@ static struct iomap_ioend *iomap_alloc_ioend(struct iomap_writepage_ctx *wpc,
 
 	ioend = iomap_ioend_from_bio(bio);
 	INIT_LIST_HEAD(&ioend->io_list);
-	ioend->io_type = wpc->iomap.type;
-	ioend->io_flags = wpc->iomap.flags;
+	ioend->io_flags = ioend_flags;
 	if (pos > wpc->iomap.offset)
 		wpc->iomap.flags &= ~IOMAP_F_BOUNDARY;
 	ioend->io_inode = inode;
@@ -1737,14 +1734,13 @@ static struct iomap_ioend *iomap_alloc_ioend(struct iomap_writepage_ctx *wpc,
 	return ioend;
 }
 
-static bool iomap_can_add_to_ioend(struct iomap_writepage_ctx *wpc, loff_t pos)
+static bool iomap_can_add_to_ioend(struct iomap_writepage_ctx *wpc, loff_t pos,
+		u16 ioend_flags)
 {
-	if (wpc->iomap.offset == pos && (wpc->iomap.flags & IOMAP_F_BOUNDARY))
-		return false;
-	if ((wpc->iomap.flags & IOMAP_F_SHARED) !=
-	    (wpc->ioend->io_flags & IOMAP_F_SHARED))
+	if (ioend_flags & IOMAP_IOEND_BOUNDARY)
 		return false;
-	if (wpc->iomap.type != wpc->ioend->io_type)
+	if ((ioend_flags & IOMAP_IOEND_NOMERGE_FLAGS) !=
+	    (wpc->ioend->io_flags & IOMAP_IOEND_NOMERGE_FLAGS))
 		return false;
 	if (pos != wpc->ioend->io_offset + wpc->ioend->io_size)
 		return false;
@@ -1778,14 +1774,23 @@ static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc,
 {
 	struct iomap_folio_state *ifs = folio->private;
 	size_t poff = offset_in_folio(folio, pos);
+	unsigned int ioend_flags = 0;
 	int error;
 
-	if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, pos)) {
+	if (wpc->iomap.type == IOMAP_UNWRITTEN)
+		ioend_flags |= IOMAP_IOEND_UNWRITTEN;
+	if (wpc->iomap.flags & IOMAP_F_SHARED)
+		ioend_flags |= IOMAP_IOEND_SHARED;
+	if (pos == wpc->iomap.offset && (wpc->iomap.flags & IOMAP_F_BOUNDARY))
+		ioend_flags |= IOMAP_IOEND_BOUNDARY;
+
+	if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, pos, ioend_flags)) {
 new_ioend:
 		error = iomap_submit_ioend(wpc, 0);
 		if (error)
 			return error;
-		wpc->ioend = iomap_alloc_ioend(wpc, wbc, inode, pos);
+		wpc->ioend = iomap_alloc_ioend(wpc, wbc, inode, pos,
+				ioend_flags);
 	}
 
 	if (!bio_add_folio(&wpc->ioend->io_bio, folio, len, poff))
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index d175853da5ae..d35ac4c19fb2 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -114,7 +114,7 @@ xfs_end_ioend(
 	 */
 	error = blk_status_to_errno(ioend->io_bio.bi_status);
 	if (unlikely(error)) {
-		if (ioend->io_flags & IOMAP_F_SHARED) {
+		if (ioend->io_flags & IOMAP_IOEND_SHARED) {
 			xfs_reflink_cancel_cow_range(ip, offset, size, true);
 			xfs_bmap_punch_delalloc_range(ip, XFS_DATA_FORK, offset,
 					offset + size);
@@ -125,9 +125,9 @@ xfs_end_ioend(
 	/*
 	 * Success: commit the COW or unwritten blocks if needed.
 	 */
-	if (ioend->io_flags & IOMAP_F_SHARED)
+	if (ioend->io_flags & IOMAP_IOEND_SHARED)
 		error = xfs_reflink_end_cow(ip, offset, size);
-	else if (ioend->io_type == IOMAP_UNWRITTEN)
+	else if (ioend->io_flags & IOMAP_IOEND_UNWRITTEN)
 		error = xfs_iomap_write_unwritten(ip, offset, size, false);
 
 	if (!error && xfs_ioend_is_append(ioend))
@@ -410,7 +410,7 @@ xfs_submit_ioend(
 	nofs_flag = memalloc_nofs_save();
 
 	/* Convert CoW extents to regular */
-	if (!status && (ioend->io_flags & IOMAP_F_SHARED)) {
+	if (!status && (ioend->io_flags & IOMAP_IOEND_SHARED)) {
 		status = xfs_reflink_convert_cow(XFS_I(ioend->io_inode),
 				ioend->io_offset, ioend->io_size);
 	}
@@ -418,8 +418,8 @@ xfs_submit_ioend(
 	memalloc_nofs_restore(nofs_flag);
 
 	/* send ioends that might require a transaction to the completion wq */
-	if (xfs_ioend_is_append(ioend) || ioend->io_type == IOMAP_UNWRITTEN ||
-	    (ioend->io_flags & IOMAP_F_SHARED))
+	if (xfs_ioend_is_append(ioend) ||
+	    (ioend->io_flags & (IOMAP_IOEND_UNWRITTEN | IOMAP_IOEND_SHARED)))
 		ioend->io_bio.bi_end_io = xfs_end_bio;
 
 	if (status)
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index c0339678d798..1d8658c7beb8 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -327,13 +327,25 @@ loff_t iomap_seek_data(struct inode *inode, loff_t offset,
 sector_t iomap_bmap(struct address_space *mapping, sector_t bno,
 		const struct iomap_ops *ops);
 
+/*
+ * Flags for iomap_ioend->io_flags.
+ */
+/* shared COW extent */
+#define IOMAP_IOEND_SHARED		(1U << 0)
+/* unwritten extent */
+#define IOMAP_IOEND_UNWRITTEN		(1U << 1)
+/* don't merge into previous ioend */
+#define IOMAP_IOEND_BOUNDARY		(1U << 2)
+
+#define IOMAP_IOEND_NOMERGE_FLAGS \
+	(IOMAP_IOEND_SHARED | IOMAP_IOEND_UNWRITTEN)
+
 /*
  * Structure for writeback I/O completions.
  */
 struct iomap_ioend {
 	struct list_head	io_list;	/* next ioend in chain */
-	u16			io_type;
-	u16			io_flags;	/* IOMAP_F_* */
+	u16			io_flags;	/* IOMAP_IOEND_* */
 	struct inode		*io_inode;	/* file being written to */
 	size_t			io_size;	/* size of the extent */
 	loff_t			io_offset;	/* offset in the file */
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 3/8] iomap: add a IOMAP_F_ZONE_APPEND flag
  2024-12-11  8:53 RFC: iomap patches for zoned XFS Christoph Hellwig
  2024-12-11  8:53 ` [PATCH 1/8] iomap: allow the file system to submit the writeback bios Christoph Hellwig
  2024-12-11  8:53 ` [PATCH 2/8] iomap: simplify io_flags and io_type in struct iomap_ioend Christoph Hellwig
@ 2024-12-11  8:53 ` Christoph Hellwig
  2024-12-12 18:05   ` Darrick J. Wong
  2024-12-11  8:53 ` [PATCH 4/8] iomap: split bios to zone append limits in the submission handlers Christoph Hellwig
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 30+ messages in thread
From: Christoph Hellwig @ 2024-12-11  8:53 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Darrick J. Wong, Carlos Maiolino, linux-xfs, linux-fsdevel

This doesn't much - just always returns the start block number for each
iomap instead of increasing it.  This is because we'll keep building bios
unconstrained by the hardware limits and just split them in file system
submission handler.

Maybe we should find another name for it, because it might be useful for
btrfs compressed bio submissions as well, but I can't come up with a
good one.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap/buffered-io.c | 19 ++++++++++++++++---
 include/linux/iomap.h  |  7 +++++++
 2 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 3176dc996fb7..129cd96c6c96 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1744,9 +1744,22 @@ static bool iomap_can_add_to_ioend(struct iomap_writepage_ctx *wpc, loff_t pos,
 		return false;
 	if (pos != wpc->ioend->io_offset + wpc->ioend->io_size)
 		return false;
-	if (iomap_sector(&wpc->iomap, pos) !=
-	    bio_end_sector(&wpc->ioend->io_bio))
-		return false;
+	if (wpc->iomap.flags & IOMAP_F_ZONE_APPEND) {
+		/*
+		 * For Zone Append command, bi_sector points to the zone start
+		 * before submission.  We can merge all I/O for the same zone.
+		 */
+		if (iomap_sector(&wpc->iomap, pos) !=
+		    wpc->ioend->io_bio.bi_iter.bi_sector)
+			return false;
+	} else {
+		/*
+		 * For regular writes, the disk blocks needs to be contiguous.
+		 */
+		if (iomap_sector(&wpc->iomap, pos) !=
+		    bio_end_sector(&wpc->ioend->io_bio))
+			return false;
+	}
 	/*
 	 * Limit ioend bio chain lengths to minimise IO completion latency. This
 	 * also prevents long tight loops ending page writeback on all the
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 1d8658c7beb8..173d490c20ba 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -56,6 +56,10 @@ struct vm_fault;
  *
  * IOMAP_F_BOUNDARY indicates that I/O and I/O completions for this iomap must
  * never be merged with the mapping before it.
+ *
+ * IOMAP_F_ZONE_APPEND indicates that (write) I/O should be done as a zone
+ * append command for zoned devices.  Note that the file system needs to
+ * override the bi_end_io handler to record the actual written sector.
  */
 #define IOMAP_F_NEW		(1U << 0)
 #define IOMAP_F_DIRTY		(1U << 1)
@@ -68,6 +72,7 @@ struct vm_fault;
 #endif /* CONFIG_BUFFER_HEAD */
 #define IOMAP_F_XATTR		(1U << 5)
 #define IOMAP_F_BOUNDARY	(1U << 6)
+#define IOMAP_F_ZONE_APPEND	(1U << 7)
 
 /*
  * Flags set by the core iomap code during operations:
@@ -111,6 +116,8 @@ struct iomap {
 
 static inline sector_t iomap_sector(const struct iomap *iomap, loff_t pos)
 {
+	if (iomap->flags & IOMAP_F_ZONE_APPEND)
+		return iomap->addr >> SECTOR_SHIFT;
 	return (iomap->addr + pos - iomap->offset) >> SECTOR_SHIFT;
 }
 
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 4/8] iomap: split bios to zone append limits in the submission handlers
  2024-12-11  8:53 RFC: iomap patches for zoned XFS Christoph Hellwig
                   ` (2 preceding siblings ...)
  2024-12-11  8:53 ` [PATCH 3/8] iomap: add a IOMAP_F_ZONE_APPEND flag Christoph Hellwig
@ 2024-12-11  8:53 ` Christoph Hellwig
  2024-12-12 13:28   ` Brian Foster
                     ` (2 more replies)
  2024-12-11  8:53 ` [PATCH 5/8] iomap: optionally use ioends for direct I/O Christoph Hellwig
                   ` (4 subsequent siblings)
  8 siblings, 3 replies; 30+ messages in thread
From: Christoph Hellwig @ 2024-12-11  8:53 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Darrick J. Wong, Carlos Maiolino, linux-xfs, linux-fsdevel

Provide helpers for file systems to split bios in the direct I/O and
writeback I/O submission handlers.

This Follows btrfs' lead and don't try to build bios to hardware limits
for zone append commands, but instead build them as normal unconstrained
bios and split them to the hardware limits in the I/O submission handler.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap/Makefile      |  1 +
 fs/iomap/buffered-io.c | 43 ++++++++++++++-----------
 fs/iomap/ioend.c       | 73 ++++++++++++++++++++++++++++++++++++++++++
 include/linux/iomap.h  |  9 ++++++
 4 files changed, 108 insertions(+), 18 deletions(-)
 create mode 100644 fs/iomap/ioend.c

diff --git a/fs/iomap/Makefile b/fs/iomap/Makefile
index 381d76c5c232..69e8ebb41302 100644
--- a/fs/iomap/Makefile
+++ b/fs/iomap/Makefile
@@ -12,6 +12,7 @@ iomap-y				+= trace.o \
 				   iter.o
 iomap-$(CONFIG_BLOCK)		+= buffered-io.o \
 				   direct-io.o \
+				   ioend.o \
 				   fiemap.o \
 				   seek.o
 iomap-$(CONFIG_SWAP)		+= swapfile.o
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 129cd96c6c96..8125f758a99d 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -40,7 +40,8 @@ struct iomap_folio_state {
 	unsigned long		state[];
 };
 
-static struct bio_set iomap_ioend_bioset;
+struct bio_set iomap_ioend_bioset;
+EXPORT_SYMBOL_GPL(iomap_ioend_bioset);
 
 static inline bool ifs_is_fully_uptodate(struct folio *folio,
 		struct iomap_folio_state *ifs)
@@ -1539,15 +1540,15 @@ static void iomap_finish_folio_write(struct inode *inode, struct folio *folio,
  * ioend after this.
  */
 static u32
-iomap_finish_ioend(struct iomap_ioend *ioend, int error)
+iomap_finish_ioend_buffered(struct iomap_ioend *ioend)
 {
 	struct inode *inode = ioend->io_inode;
 	struct bio *bio = &ioend->io_bio;
 	struct folio_iter fi;
 	u32 folio_count = 0;
 
-	if (error) {
-		mapping_set_error(inode->i_mapping, error);
+	if (ioend->io_error) {
+		mapping_set_error(inode->i_mapping, ioend->io_error);
 		if (!bio_flagged(bio, BIO_QUIET)) {
 			pr_err_ratelimited(
 "%s: writeback error on inode %lu, offset %lld, sector %llu",
@@ -1566,6 +1567,24 @@ iomap_finish_ioend(struct iomap_ioend *ioend, int error)
 	return folio_count;
 }
 
+static u32
+iomap_finish_ioend(struct iomap_ioend *ioend, int error)
+{
+	if (ioend->io_parent) {
+		struct bio *bio = &ioend->io_bio;
+
+		ioend = ioend->io_parent;
+		bio_put(bio);
+	}
+
+	if (error)
+		cmpxchg(&ioend->io_error, 0, error);
+
+	if (!atomic_dec_and_test(&ioend->io_remaining))
+		return 0;
+	return iomap_finish_ioend_buffered(ioend);
+}
+
 /*
  * Ioend completion routine for merged bios. This can only be called from task
  * contexts as merged ioends can be of unbound length. Hence we have to break up
@@ -1709,7 +1728,6 @@ static struct iomap_ioend *iomap_alloc_ioend(struct iomap_writepage_ctx *wpc,
 		struct writeback_control *wbc, struct inode *inode, loff_t pos,
 		u16 ioend_flags)
 {
-	struct iomap_ioend *ioend;
 	struct bio *bio;
 
 	bio = bio_alloc_bioset(wpc->iomap.bdev, BIO_MAX_VECS,
@@ -1717,21 +1735,10 @@ static struct iomap_ioend *iomap_alloc_ioend(struct iomap_writepage_ctx *wpc,
 			       GFP_NOFS, &iomap_ioend_bioset);
 	bio->bi_iter.bi_sector = iomap_sector(&wpc->iomap, pos);
 	bio->bi_end_io = iomap_writepage_end_bio;
-	wbc_init_bio(wbc, bio);
 	bio->bi_write_hint = inode->i_write_hint;
-
-	ioend = iomap_ioend_from_bio(bio);
-	INIT_LIST_HEAD(&ioend->io_list);
-	ioend->io_flags = ioend_flags;
-	if (pos > wpc->iomap.offset)
-		wpc->iomap.flags &= ~IOMAP_F_BOUNDARY;
-	ioend->io_inode = inode;
-	ioend->io_size = 0;
-	ioend->io_offset = pos;
-	ioend->io_sector = bio->bi_iter.bi_sector;
-
+	wbc_init_bio(wbc, bio);
 	wpc->nr_folios = 0;
-	return ioend;
+	return iomap_init_ioend(inode, bio, pos, ioend_flags);
 }
 
 static bool iomap_can_add_to_ioend(struct iomap_writepage_ctx *wpc, loff_t pos,
diff --git a/fs/iomap/ioend.c b/fs/iomap/ioend.c
new file mode 100644
index 000000000000..f3d98121c593
--- /dev/null
+++ b/fs/iomap/ioend.c
@@ -0,0 +1,73 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2024 Christoph Hellwig.
+ */
+#include <linux/iomap.h>
+
+struct iomap_ioend *iomap_init_ioend(struct inode *inode,
+		struct bio *bio, loff_t file_offset, u16 flags)
+{
+	struct iomap_ioend *ioend = iomap_ioend_from_bio(bio);
+
+	atomic_set(&ioend->io_remaining, 1);
+	ioend->io_error = 0;
+	ioend->io_parent = NULL;
+	INIT_LIST_HEAD(&ioend->io_list);
+	ioend->io_flags = flags;
+	ioend->io_inode = inode;
+	ioend->io_offset = file_offset;
+	ioend->io_size = bio->bi_iter.bi_size;
+	ioend->io_sector = bio->bi_iter.bi_sector;
+	return ioend;
+}
+EXPORT_SYMBOL_GPL(iomap_init_ioend);
+
+struct iomap_ioend *iomap_split_ioend(struct iomap_ioend *ioend, bool is_append,
+		unsigned int *alloc_len)
+{
+	struct bio *bio = &ioend->io_bio;
+	struct iomap_ioend *split_ioend;
+	struct bio *split;
+	int sector_offset;
+	unsigned int nr_segs;
+
+	if (is_append) {
+		struct queue_limits *lim = bdev_limits(bio->bi_bdev);
+
+		sector_offset = bio_split_rw_at(bio, lim, &nr_segs,
+			min(lim->max_zone_append_sectors << SECTOR_SHIFT,
+			    *alloc_len));
+		if (!sector_offset)
+			return NULL;
+	} else {
+		if (bio->bi_iter.bi_size <= *alloc_len)
+			return NULL;
+		sector_offset = *alloc_len >> SECTOR_SHIFT;
+	}
+
+	/* ensure the split ioend is still block size aligned */
+	sector_offset = ALIGN_DOWN(sector_offset << SECTOR_SHIFT,
+			i_blocksize(ioend->io_inode)) >> SECTOR_SHIFT;
+
+	split = bio_split(bio, sector_offset, GFP_NOFS, &iomap_ioend_bioset);
+	if (!split)
+		return NULL;
+	split->bi_private = bio->bi_private;
+	split->bi_end_io = bio->bi_end_io;
+
+	split_ioend = iomap_init_ioend(ioend->io_inode, split, ioend->io_offset,
+			ioend->io_flags);
+	split_ioend->io_parent = ioend;
+
+	atomic_inc(&ioend->io_remaining);
+	ioend->io_offset += split_ioend->io_size;
+	ioend->io_size -= split_ioend->io_size;
+
+	split_ioend->io_sector = ioend->io_sector;
+	if (!is_append)
+		ioend->io_sector += (split_ioend->io_size >> SECTOR_SHIFT);
+
+	*alloc_len -= split->bi_iter.bi_size;
+	return split_ioend;
+}
+EXPORT_SYMBOL_GPL(iomap_split_ioend);
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 173d490c20ba..eaa8cb9083eb 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -354,6 +354,9 @@ struct iomap_ioend {
 	struct list_head	io_list;	/* next ioend in chain */
 	u16			io_flags;	/* IOMAP_IOEND_* */
 	struct inode		*io_inode;	/* file being written to */
+	atomic_t		io_remaining;	/* completetion defer count */
+	int			io_error;	/* stashed away status */
+	struct iomap_ioend	*io_parent;	/* parent for completions */
 	size_t			io_size;	/* size of the extent */
 	loff_t			io_offset;	/* offset in the file */
 	sector_t		io_sector;	/* start sector of ioend */
@@ -404,6 +407,10 @@ struct iomap_writepage_ctx {
 	u32			nr_folios;	/* folios added to the ioend */
 };
 
+struct iomap_ioend *iomap_init_ioend(struct inode *inode, struct bio *bio,
+		loff_t file_offset, u16 flags);
+struct iomap_ioend *iomap_split_ioend(struct iomap_ioend *ioend, bool is_append,
+		unsigned int *alloc_len);
 void iomap_finish_ioends(struct iomap_ioend *ioend, int error);
 void iomap_ioend_try_merge(struct iomap_ioend *ioend,
 		struct list_head *more_ioends);
@@ -475,4 +482,6 @@ int iomap_swapfile_activate(struct swap_info_struct *sis,
 # define iomap_swapfile_activate(sis, swapfile, pagespan, ops)	(-EIO)
 #endif /* CONFIG_SWAP */
 
+extern struct bio_set iomap_ioend_bioset;
+
 #endif /* LINUX_IOMAP_H */
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 5/8] iomap: optionally use ioends for direct I/O
  2024-12-11  8:53 RFC: iomap patches for zoned XFS Christoph Hellwig
                   ` (3 preceding siblings ...)
  2024-12-11  8:53 ` [PATCH 4/8] iomap: split bios to zone append limits in the submission handlers Christoph Hellwig
@ 2024-12-11  8:53 ` Christoph Hellwig
  2024-12-12 13:29   ` Brian Foster
  2024-12-12 19:56   ` Darrick J. Wong
  2024-12-11  8:53 ` [PATCH 6/8] iomap: pass private data to iomap_page_mkwrite Christoph Hellwig
                   ` (3 subsequent siblings)
  8 siblings, 2 replies; 30+ messages in thread
From: Christoph Hellwig @ 2024-12-11  8:53 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Darrick J. Wong, Carlos Maiolino, linux-xfs, linux-fsdevel

struct iomap_ioend currently tracks outstanding buffered writes and has
some really nice code in core iomap and XFS to merge contiguous I/Os
an defer them to userspace for completion in a very efficient way.

For zoned writes we'll also need a per-bio user context completion to
record the written blocks, and the infrastructure for that would look
basically like the ioend handling for buffered I/O.

So instead of reinventing the wheel, reuse the existing infrastructure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap/buffered-io.c |  3 +++
 fs/iomap/direct-io.c   | 50 +++++++++++++++++++++++++++++++++++++++++-
 fs/iomap/internal.h    |  7 ++++++
 include/linux/iomap.h  |  4 +++-
 4 files changed, 62 insertions(+), 2 deletions(-)
 create mode 100644 fs/iomap/internal.h

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 8125f758a99d..ceca9473a09c 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -17,6 +17,7 @@
 #include <linux/bio.h>
 #include <linux/sched/signal.h>
 #include <linux/migrate.h>
+#include "internal.h"
 #include "trace.h"
 
 #include "../internal.h"
@@ -1582,6 +1583,8 @@ iomap_finish_ioend(struct iomap_ioend *ioend, int error)
 
 	if (!atomic_dec_and_test(&ioend->io_remaining))
 		return 0;
+	if (ioend->io_flags & IOMAP_IOEND_DIRECT)
+		return iomap_finish_ioend_direct(ioend);
 	return iomap_finish_ioend_buffered(ioend);
 }
 
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index b521eb15759e..b5466361cafe 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -12,6 +12,7 @@
 #include <linux/backing-dev.h>
 #include <linux/uio.h>
 #include <linux/task_io_accounting_ops.h>
+#include "internal.h"
 #include "trace.h"
 
 #include "../internal.h"
@@ -20,6 +21,7 @@
  * Private flags for iomap_dio, must not overlap with the public ones in
  * iomap.h:
  */
+#define IOMAP_DIO_NO_INVALIDATE	(1U << 25)
 #define IOMAP_DIO_CALLER_COMP	(1U << 26)
 #define IOMAP_DIO_INLINE_COMP	(1U << 27)
 #define IOMAP_DIO_WRITE_THROUGH	(1U << 28)
@@ -117,7 +119,8 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
 	 * ->end_io() when necessary, otherwise a racing buffer read would cache
 	 * zeros from unwritten extents.
 	 */
-	if (!dio->error && dio->size && (dio->flags & IOMAP_DIO_WRITE))
+	if (!dio->error && dio->size && (dio->flags & IOMAP_DIO_WRITE) &&
+	    !(dio->flags & IOMAP_DIO_NO_INVALIDATE))
 		kiocb_invalidate_post_direct_write(iocb, dio->size);
 
 	inode_dio_end(file_inode(iocb->ki_filp));
@@ -163,6 +166,51 @@ static inline void iomap_dio_set_error(struct iomap_dio *dio, int ret)
 	cmpxchg(&dio->error, 0, ret);
 }
 
+u32 iomap_finish_ioend_direct(struct iomap_ioend *ioend)
+{
+	struct iomap_dio *dio = ioend->io_bio.bi_private;
+	bool should_dirty = (dio->flags & IOMAP_DIO_DIRTY);
+	struct kiocb *iocb = dio->iocb;
+	u32 vec_count = ioend->io_bio.bi_vcnt;
+
+	if (ioend->io_error)
+		iomap_dio_set_error(dio, ioend->io_error);
+
+	if (atomic_dec_and_test(&dio->ref)) {
+		struct inode *inode = file_inode(iocb->ki_filp);
+
+		if (dio->wait_for_completion) {
+			struct task_struct *waiter = dio->submit.waiter;
+
+			WRITE_ONCE(dio->submit.waiter, NULL);
+			blk_wake_io_task(waiter);
+		} else if (!inode->i_mapping->nrpages) {
+			WRITE_ONCE(iocb->private, NULL);
+
+			/*
+			 * We must never invalidate pages from this thread to
+			 * avoid deadlocks with buffered I/O completions.
+			 * Tough luck if you hit the tiny race with someone
+			 * dirtying the range now.
+			 */
+			dio->flags |= IOMAP_DIO_NO_INVALIDATE;
+			iomap_dio_complete_work(&dio->aio.work);
+		} else {
+			INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
+			queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work);
+		}
+	}
+
+	if (should_dirty) {
+		bio_check_pages_dirty(&ioend->io_bio);
+	} else {
+		bio_release_pages(&ioend->io_bio, false);
+		bio_put(&ioend->io_bio);
+	}
+
+	return vec_count;
+}
+
 void iomap_dio_bio_end_io(struct bio *bio)
 {
 	struct iomap_dio *dio = bio->bi_private;
diff --git a/fs/iomap/internal.h b/fs/iomap/internal.h
new file mode 100644
index 000000000000..20cccfc3bb13
--- /dev/null
+++ b/fs/iomap/internal.h
@@ -0,0 +1,7 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _IOMAP_INTERNAL_H
+#define _IOMAP_INTERNAL_H 1
+
+u32 iomap_finish_ioend_direct(struct iomap_ioend *ioend);
+
+#endif /* _IOMAP_INTERNAL_H */
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index eaa8cb9083eb..f6943c80e5fd 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -343,9 +343,11 @@ sector_t iomap_bmap(struct address_space *mapping, sector_t bno,
 #define IOMAP_IOEND_UNWRITTEN		(1U << 1)
 /* don't merge into previous ioend */
 #define IOMAP_IOEND_BOUNDARY		(1U << 2)
+/* is direct I/O */
+#define IOMAP_IOEND_DIRECT		(1U << 3)
 
 #define IOMAP_IOEND_NOMERGE_FLAGS \
-	(IOMAP_IOEND_SHARED | IOMAP_IOEND_UNWRITTEN)
+	(IOMAP_IOEND_SHARED | IOMAP_IOEND_UNWRITTEN | IOMAP_IOEND_DIRECT)
 
 /*
  * Structure for writeback I/O completions.
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 6/8] iomap: pass private data to iomap_page_mkwrite
  2024-12-11  8:53 RFC: iomap patches for zoned XFS Christoph Hellwig
                   ` (4 preceding siblings ...)
  2024-12-11  8:53 ` [PATCH 5/8] iomap: optionally use ioends for direct I/O Christoph Hellwig
@ 2024-12-11  8:53 ` Christoph Hellwig
  2024-12-11  8:53 ` [PATCH 7/8] iomap: pass private data to iomap_zero_range Christoph Hellwig
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2024-12-11  8:53 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Darrick J. Wong, Carlos Maiolino, linux-xfs, linux-fsdevel

Allow the file system to pass private data which can be used by the
iomap_begin and iomap_end methods through the private pointer in the
iomap_iter structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap/buffered-io.c | 4 +++-
 fs/xfs/xfs_file.c      | 3 ++-
 fs/zonefs/file.c       | 2 +-
 include/linux/iomap.h  | 5 ++---
 4 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index ceca9473a09c..0863c78dc69f 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1495,11 +1495,13 @@ static loff_t iomap_folio_mkwrite_iter(struct iomap_iter *iter,
 	return length;
 }
 
-vm_fault_t iomap_page_mkwrite(struct vm_fault *vmf, const struct iomap_ops *ops)
+vm_fault_t iomap_page_mkwrite(struct vm_fault *vmf, const struct iomap_ops *ops,
+		void *private)
 {
 	struct iomap_iter iter = {
 		.inode		= file_inode(vmf->vma->vm_file),
 		.flags		= IOMAP_WRITE | IOMAP_FAULT,
+		.private	= private,
 	};
 	struct folio *folio = page_folio(vmf->page);
 	ssize_t ret;
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 9a435b1ff264..2b6d4c71994d 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1498,7 +1498,8 @@ xfs_write_fault(
 	if (IS_DAX(inode))
 		ret = xfs_dax_fault_locked(vmf, order, true);
 	else
-		ret = iomap_page_mkwrite(vmf, &xfs_buffered_write_iomap_ops);
+		ret = iomap_page_mkwrite(vmf, &xfs_buffered_write_iomap_ops,
+				NULL);
 	xfs_iunlock(ip, lock_mode);
 
 	sb_end_pagefault(inode->i_sb);
diff --git a/fs/zonefs/file.c b/fs/zonefs/file.c
index 35166c92420c..42e2c0065bb3 100644
--- a/fs/zonefs/file.c
+++ b/fs/zonefs/file.c
@@ -299,7 +299,7 @@ static vm_fault_t zonefs_filemap_page_mkwrite(struct vm_fault *vmf)
 
 	/* Serialize against truncates */
 	filemap_invalidate_lock_shared(inode->i_mapping);
-	ret = iomap_page_mkwrite(vmf, &zonefs_write_iomap_ops);
+	ret = iomap_page_mkwrite(vmf, &zonefs_write_iomap_ops, NULL);
 	filemap_invalidate_unlock_shared(inode->i_mapping);
 
 	sb_end_pagefault(inode->i_sb);
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index f6943c80e5fd..2d1c53024ed1 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -316,9 +316,8 @@ int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len,
 		bool *did_zero, const struct iomap_ops *ops);
 int iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
 		const struct iomap_ops *ops);
-vm_fault_t iomap_page_mkwrite(struct vm_fault *vmf,
-			const struct iomap_ops *ops);
-
+vm_fault_t iomap_page_mkwrite(struct vm_fault *vmf, const struct iomap_ops *ops,
+		void *private);
 typedef void (*iomap_punch_t)(struct inode *inode, loff_t offset, loff_t length,
 		struct iomap *iomap);
 void iomap_write_delalloc_release(struct inode *inode, loff_t start_byte,
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 7/8] iomap: pass private data to iomap_zero_range
  2024-12-11  8:53 RFC: iomap patches for zoned XFS Christoph Hellwig
                   ` (5 preceding siblings ...)
  2024-12-11  8:53 ` [PATCH 6/8] iomap: pass private data to iomap_page_mkwrite Christoph Hellwig
@ 2024-12-11  8:53 ` Christoph Hellwig
  2024-12-11  8:53 ` [PATCH 8/8] iomap: pass private data to iomap_truncate_page Christoph Hellwig
  2024-12-12 13:29 ` RFC: iomap patches for zoned XFS Brian Foster
  8 siblings, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2024-12-11  8:53 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Darrick J. Wong, Carlos Maiolino, linux-xfs, linux-fsdevel

Allow the file system to pass private data which can be used by the
iomap_begin and iomap_end methods through the private pointer in the
iomap_iter structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/gfs2/bmap.c         | 3 ++-
 fs/iomap/buffered-io.c | 6 ++++--
 fs/xfs/xfs_iomap.c     | 2 +-
 include/linux/iomap.h  | 2 +-
 4 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index 1795c4e8dbf6..366516b98b3f 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -1300,7 +1300,8 @@ static int gfs2_block_zero_range(struct inode *inode, loff_t from,
 				 unsigned int length)
 {
 	BUG_ON(current->journal_info);
-	return iomap_zero_range(inode, from, length, NULL, &gfs2_iomap_ops);
+	return iomap_zero_range(inode, from, length, NULL, &gfs2_iomap_ops,
+			NULL);
 }
 
 #define GFS2_JTRUNC_REVOKES 8192
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 0863c78dc69f..6bfee1c7aedb 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1397,13 +1397,14 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
 
 int
 iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
-		const struct iomap_ops *ops)
+		const struct iomap_ops *ops, void *private)
 {
 	struct iomap_iter iter = {
 		.inode		= inode,
 		.pos		= pos,
 		.len		= len,
 		.flags		= IOMAP_ZERO,
+		.private	= private,
 	};
 	struct address_space *mapping = inode->i_mapping;
 	unsigned int blocksize = i_blocksize(inode);
@@ -1471,7 +1472,8 @@ iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
 	/* Block boundary? Nothing to do */
 	if (!off)
 		return 0;
-	return iomap_zero_range(inode, pos, blocksize - off, did_zero, ops);
+	return iomap_zero_range(inode, pos, blocksize - off, did_zero, ops,
+			NULL);
 }
 EXPORT_SYMBOL_GPL(iomap_truncate_page);
 
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 50fa3ef89f6c..3410c55f544a 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1497,7 +1497,7 @@ xfs_zero_range(
 		return dax_zero_range(inode, pos, len, did_zero,
 				      &xfs_dax_write_iomap_ops);
 	return iomap_zero_range(inode, pos, len, did_zero,
-				&xfs_buffered_write_iomap_ops);
+				&xfs_buffered_write_iomap_ops, NULL);
 }
 
 int
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 2d1c53024ed1..2a88dfa6ec55 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -313,7 +313,7 @@ bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio);
 int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
 		const struct iomap_ops *ops);
 int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len,
-		bool *did_zero, const struct iomap_ops *ops);
+		bool *did_zero, const struct iomap_ops *ops, void *private);
 int iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
 		const struct iomap_ops *ops);
 vm_fault_t iomap_page_mkwrite(struct vm_fault *vmf, const struct iomap_ops *ops,
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 8/8] iomap: pass private data to iomap_truncate_page
  2024-12-11  8:53 RFC: iomap patches for zoned XFS Christoph Hellwig
                   ` (6 preceding siblings ...)
  2024-12-11  8:53 ` [PATCH 7/8] iomap: pass private data to iomap_zero_range Christoph Hellwig
@ 2024-12-11  8:53 ` Christoph Hellwig
  2024-12-12 19:56   ` Darrick J. Wong
  2024-12-12 13:29 ` RFC: iomap patches for zoned XFS Brian Foster
  8 siblings, 1 reply; 30+ messages in thread
From: Christoph Hellwig @ 2024-12-11  8:53 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Darrick J. Wong, Carlos Maiolino, linux-xfs, linux-fsdevel

Allow the file system to pass private data which can be used by the
iomap_begin and iomap_end methods through the private pointer in the
iomap_iter structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap/buffered-io.c | 4 ++--
 fs/xfs/xfs_iomap.c     | 2 +-
 include/linux/iomap.h  | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 6bfee1c7aedb..ccb2c6cbb18e 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1464,7 +1464,7 @@ EXPORT_SYMBOL_GPL(iomap_zero_range);
 
 int
 iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
-		const struct iomap_ops *ops)
+		const struct iomap_ops *ops, void *private)
 {
 	unsigned int blocksize = i_blocksize(inode);
 	unsigned int off = pos & (blocksize - 1);
@@ -1473,7 +1473,7 @@ iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
 	if (!off)
 		return 0;
 	return iomap_zero_range(inode, pos, blocksize - off, did_zero, ops,
-			NULL);
+			private);
 }
 EXPORT_SYMBOL_GPL(iomap_truncate_page);
 
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 3410c55f544a..5dd0922fe2d1 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1512,5 +1512,5 @@ xfs_truncate_page(
 		return dax_truncate_page(inode, pos, did_zero,
 					&xfs_dax_write_iomap_ops);
 	return iomap_truncate_page(inode, pos, did_zero,
-				   &xfs_buffered_write_iomap_ops);
+				   &xfs_buffered_write_iomap_ops, NULL);
 }
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 2a88dfa6ec55..19a2554622e6 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -315,7 +315,7 @@ int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
 int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len,
 		bool *did_zero, const struct iomap_ops *ops, void *private);
 int iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
-		const struct iomap_ops *ops);
+		const struct iomap_ops *ops, void *private);
 vm_fault_t iomap_page_mkwrite(struct vm_fault *vmf, const struct iomap_ops *ops,
 		void *private);
 typedef void (*iomap_punch_t)(struct inode *inode, loff_t offset, loff_t length,
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/8] iomap: split bios to zone append limits in the submission handlers
  2024-12-11  8:53 ` [PATCH 4/8] iomap: split bios to zone append limits in the submission handlers Christoph Hellwig
@ 2024-12-12 13:28   ` Brian Foster
  2024-12-12 15:05     ` Christoph Hellwig
  2024-12-12 14:21   ` John Garry
  2024-12-12 19:51   ` Darrick J. Wong
  2 siblings, 1 reply; 30+ messages in thread
From: Brian Foster @ 2024-12-12 13:28 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Christian Brauner, Darrick J. Wong, Carlos Maiolino, linux-xfs,
	linux-fsdevel

On Wed, Dec 11, 2024 at 09:53:44AM +0100, Christoph Hellwig wrote:
> Provide helpers for file systems to split bios in the direct I/O and
> writeback I/O submission handlers.
> 
> This Follows btrfs' lead and don't try to build bios to hardware limits
> for zone append commands, but instead build them as normal unconstrained
> bios and split them to the hardware limits in the I/O submission handler.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/iomap/Makefile      |  1 +
>  fs/iomap/buffered-io.c | 43 ++++++++++++++-----------
>  fs/iomap/ioend.c       | 73 ++++++++++++++++++++++++++++++++++++++++++
>  include/linux/iomap.h  |  9 ++++++
>  4 files changed, 108 insertions(+), 18 deletions(-)
>  create mode 100644 fs/iomap/ioend.c
> 
...
> diff --git a/fs/iomap/ioend.c b/fs/iomap/ioend.c
> new file mode 100644
> index 000000000000..f3d98121c593
> --- /dev/null
> +++ b/fs/iomap/ioend.c
...

It might be useful to add a small comment here to point out this splits
from the front of the ioend (i.e. akin to bio_split()), documents the
params, and maybe mentions the ioend relationship requirements (i.e.
according to bio_split(), the split ioend bio refers to the vectors in
the original ioend bio).

Brian

> +struct iomap_ioend *iomap_split_ioend(struct iomap_ioend *ioend, bool is_append,
> +		unsigned int *alloc_len)
> +{
> +	struct bio *bio = &ioend->io_bio;
> +	struct iomap_ioend *split_ioend;
> +	struct bio *split;
> +	int sector_offset;
> +	unsigned int nr_segs;
> +
> +	if (is_append) {
> +		struct queue_limits *lim = bdev_limits(bio->bi_bdev);
> +
> +		sector_offset = bio_split_rw_at(bio, lim, &nr_segs,
> +			min(lim->max_zone_append_sectors << SECTOR_SHIFT,
> +			    *alloc_len));
> +		if (!sector_offset)
> +			return NULL;
> +	} else {
> +		if (bio->bi_iter.bi_size <= *alloc_len)
> +			return NULL;
> +		sector_offset = *alloc_len >> SECTOR_SHIFT;
> +	}
> +
> +	/* ensure the split ioend is still block size aligned */
> +	sector_offset = ALIGN_DOWN(sector_offset << SECTOR_SHIFT,
> +			i_blocksize(ioend->io_inode)) >> SECTOR_SHIFT;
> +
> +	split = bio_split(bio, sector_offset, GFP_NOFS, &iomap_ioend_bioset);
> +	if (!split)
> +		return NULL;
> +	split->bi_private = bio->bi_private;
> +	split->bi_end_io = bio->bi_end_io;
> +
> +	split_ioend = iomap_init_ioend(ioend->io_inode, split, ioend->io_offset,
> +			ioend->io_flags);
> +	split_ioend->io_parent = ioend;
> +
> +	atomic_inc(&ioend->io_remaining);
> +	ioend->io_offset += split_ioend->io_size;
> +	ioend->io_size -= split_ioend->io_size;
> +
> +	split_ioend->io_sector = ioend->io_sector;
> +	if (!is_append)
> +		ioend->io_sector += (split_ioend->io_size >> SECTOR_SHIFT);
> +
> +	*alloc_len -= split->bi_iter.bi_size;
> +	return split_ioend;
> +}
> +EXPORT_SYMBOL_GPL(iomap_split_ioend);
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 173d490c20ba..eaa8cb9083eb 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -354,6 +354,9 @@ struct iomap_ioend {
>  	struct list_head	io_list;	/* next ioend in chain */
>  	u16			io_flags;	/* IOMAP_IOEND_* */
>  	struct inode		*io_inode;	/* file being written to */
> +	atomic_t		io_remaining;	/* completetion defer count */
> +	int			io_error;	/* stashed away status */
> +	struct iomap_ioend	*io_parent;	/* parent for completions */
>  	size_t			io_size;	/* size of the extent */
>  	loff_t			io_offset;	/* offset in the file */
>  	sector_t		io_sector;	/* start sector of ioend */
> @@ -404,6 +407,10 @@ struct iomap_writepage_ctx {
>  	u32			nr_folios;	/* folios added to the ioend */
>  };
>  
> +struct iomap_ioend *iomap_init_ioend(struct inode *inode, struct bio *bio,
> +		loff_t file_offset, u16 flags);
> +struct iomap_ioend *iomap_split_ioend(struct iomap_ioend *ioend, bool is_append,
> +		unsigned int *alloc_len);
>  void iomap_finish_ioends(struct iomap_ioend *ioend, int error);
>  void iomap_ioend_try_merge(struct iomap_ioend *ioend,
>  		struct list_head *more_ioends);
> @@ -475,4 +482,6 @@ int iomap_swapfile_activate(struct swap_info_struct *sis,
>  # define iomap_swapfile_activate(sis, swapfile, pagespan, ops)	(-EIO)
>  #endif /* CONFIG_SWAP */
>  
> +extern struct bio_set iomap_ioend_bioset;
> +
>  #endif /* LINUX_IOMAP_H */
> -- 
> 2.45.2
> 
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/8] iomap: optionally use ioends for direct I/O
  2024-12-11  8:53 ` [PATCH 5/8] iomap: optionally use ioends for direct I/O Christoph Hellwig
@ 2024-12-12 13:29   ` Brian Foster
  2024-12-12 15:12     ` Christoph Hellwig
  2024-12-12 19:56   ` Darrick J. Wong
  1 sibling, 1 reply; 30+ messages in thread
From: Brian Foster @ 2024-12-12 13:29 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Christian Brauner, Darrick J. Wong, Carlos Maiolino, linux-xfs,
	linux-fsdevel

On Wed, Dec 11, 2024 at 09:53:45AM +0100, Christoph Hellwig wrote:
> struct iomap_ioend currently tracks outstanding buffered writes and has
> some really nice code in core iomap and XFS to merge contiguous I/Os
> an defer them to userspace for completion in a very efficient way.
> 
> For zoned writes we'll also need a per-bio user context completion to
> record the written blocks, and the infrastructure for that would look
> basically like the ioend handling for buffered I/O.
> 
> So instead of reinventing the wheel, reuse the existing infrastructure.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/iomap/buffered-io.c |  3 +++
>  fs/iomap/direct-io.c   | 50 +++++++++++++++++++++++++++++++++++++++++-
>  fs/iomap/internal.h    |  7 ++++++
>  include/linux/iomap.h  |  4 +++-
>  4 files changed, 62 insertions(+), 2 deletions(-)
>  create mode 100644 fs/iomap/internal.h
> 
...
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index b521eb15759e..b5466361cafe 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
...
> @@ -163,6 +166,51 @@ static inline void iomap_dio_set_error(struct iomap_dio *dio, int ret)
>  	cmpxchg(&dio->error, 0, ret);
>  }
>  
> +u32 iomap_finish_ioend_direct(struct iomap_ioend *ioend)
> +{
> +	struct iomap_dio *dio = ioend->io_bio.bi_private;
> +	bool should_dirty = (dio->flags & IOMAP_DIO_DIRTY);
> +	struct kiocb *iocb = dio->iocb;
> +	u32 vec_count = ioend->io_bio.bi_vcnt;
> +
> +	if (ioend->io_error)
> +		iomap_dio_set_error(dio, ioend->io_error);
> +
> +	if (atomic_dec_and_test(&dio->ref)) {
> +		struct inode *inode = file_inode(iocb->ki_filp);
> +
> +		if (dio->wait_for_completion) {
> +			struct task_struct *waiter = dio->submit.waiter;
> +
> +			WRITE_ONCE(dio->submit.waiter, NULL);
> +			blk_wake_io_task(waiter);
> +		} else if (!inode->i_mapping->nrpages) {
> +			WRITE_ONCE(iocb->private, NULL);
> +
> +			/*
> +			 * We must never invalidate pages from this thread to
> +			 * avoid deadlocks with buffered I/O completions.
> +			 * Tough luck if you hit the tiny race with someone
> +			 * dirtying the range now.
> +			 */
> +			dio->flags |= IOMAP_DIO_NO_INVALIDATE;
> +			iomap_dio_complete_work(&dio->aio.work);
> +		} else {
> +			INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
> +			queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work);
> +		}
> +	}
> +
> +	if (should_dirty) {
> +		bio_check_pages_dirty(&ioend->io_bio);
> +	} else {
> +		bio_release_pages(&ioend->io_bio, false);
> +		bio_put(&ioend->io_bio);
> +	}
> +

Not that it matters all that much, but I'm a little curious about the
reasoning for using vec_count here. AFAICS this correlates to per-folio
writeback completions for buffered I/O, but that doesn't seem to apply
to direct I/O. Is there a reason to have the caller throttle based on
vec_counts, or are we just pulling some non-zero value for consistency
sake?

Brian

> +	return vec_count;
> +}
> +
>  void iomap_dio_bio_end_io(struct bio *bio)
>  {
>  	struct iomap_dio *dio = bio->bi_private;
> diff --git a/fs/iomap/internal.h b/fs/iomap/internal.h
> new file mode 100644
> index 000000000000..20cccfc3bb13
> --- /dev/null
> +++ b/fs/iomap/internal.h
> @@ -0,0 +1,7 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _IOMAP_INTERNAL_H
> +#define _IOMAP_INTERNAL_H 1
> +
> +u32 iomap_finish_ioend_direct(struct iomap_ioend *ioend);
> +
> +#endif /* _IOMAP_INTERNAL_H */
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index eaa8cb9083eb..f6943c80e5fd 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -343,9 +343,11 @@ sector_t iomap_bmap(struct address_space *mapping, sector_t bno,
>  #define IOMAP_IOEND_UNWRITTEN		(1U << 1)
>  /* don't merge into previous ioend */
>  #define IOMAP_IOEND_BOUNDARY		(1U << 2)
> +/* is direct I/O */
> +#define IOMAP_IOEND_DIRECT		(1U << 3)
>  
>  #define IOMAP_IOEND_NOMERGE_FLAGS \
> -	(IOMAP_IOEND_SHARED | IOMAP_IOEND_UNWRITTEN)
> +	(IOMAP_IOEND_SHARED | IOMAP_IOEND_UNWRITTEN | IOMAP_IOEND_DIRECT)
>  
>  /*
>   * Structure for writeback I/O completions.
> -- 
> 2.45.2
> 
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: RFC: iomap patches for zoned XFS
  2024-12-11  8:53 RFC: iomap patches for zoned XFS Christoph Hellwig
                   ` (7 preceding siblings ...)
  2024-12-11  8:53 ` [PATCH 8/8] iomap: pass private data to iomap_truncate_page Christoph Hellwig
@ 2024-12-12 13:29 ` Brian Foster
  8 siblings, 0 replies; 30+ messages in thread
From: Brian Foster @ 2024-12-12 13:29 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Christian Brauner, Darrick J. Wong, Carlos Maiolino, linux-xfs,
	linux-fsdevel

On Wed, Dec 11, 2024 at 09:53:40AM +0100, Christoph Hellwig wrote:
> Hi all,
> 
> this series contains the iomap prep work to support zoned XFS.
> 
> The biggest changes are:
> 
>  - an option to reuse the ioend code for direct writes in addition to the
>    current use for buffered writeback, which allows the file system to
>    track completions on a per-bio basis instead of the current end_io
>    callback which operates on the entire I/O.
>    Note that it might make sense to split the ioend code from
>    buffered-io.c into its own file with this.  Let me know what you think
>    of that and I can include it in the next version
>  - change of the writeback_ops so that the submit_bio call can be done by
>    the file system.  Note that btrfs will also need this eventually when
>    it starts using iomap
>  - helpers to split ioend to the zone append queue_limits that plug
>    into the previous item above.
>  - a bunch of changes for slightly different merge conditions when using
>    zone append.  Note that btrfs wants something similar also for
>    compressed I/O, which might be able to share some code.  For now
>    the flags use zone append naming, but we can change that if it gets
>    used elsewhere.
>  - passing private data to a few more helper
> 
> The XFS changes to use this will be posted to the xfs list only to not
> spam fsdevel too much.
> 

Orthogonal to the couple or so questions inline, the series LGTM:

Reviewed-by: Brian Foster <bfoster@redhat.com>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/8] iomap: split bios to zone append limits in the submission handlers
  2024-12-11  8:53 ` [PATCH 4/8] iomap: split bios to zone append limits in the submission handlers Christoph Hellwig
  2024-12-12 13:28   ` Brian Foster
@ 2024-12-12 14:21   ` John Garry
  2024-12-12 15:07     ` Christoph Hellwig
  2024-12-12 19:51   ` Darrick J. Wong
  2 siblings, 1 reply; 30+ messages in thread
From: John Garry @ 2024-12-12 14:21 UTC (permalink / raw)
  To: Christoph Hellwig, Christian Brauner
  Cc: Darrick J. Wong, Carlos Maiolino, linux-xfs, linux-fsdevel

On 11/12/2024 08:53, Christoph Hellwig wrote:
> +	if (is_append) {
> +		struct queue_limits *lim = bdev_limits(bio->bi_bdev);
> +
> +		sector_offset = bio_split_rw_at(bio, lim, &nr_segs,
> +			min(lim->max_zone_append_sectors << SECTOR_SHIFT,
> +			    *alloc_len));
> +		if (!sector_offset)

Should this be:

		if (sector_offset <= 0)

> +			return NULL;
> +	} else {
> +		if (bio->bi_iter.bi_size <= *alloc_len)
> +			return NULL;
> +		sector_offset = *alloc_len >> SECTOR_SHIFT;
> +	}
> +
> +	/* ensure the split ioend is still block size aligned */
> +	sector_offset = ALIGN_DOWN(sector_offset << SECTOR_SHIFT,
> +			i_blocksize(ioend->io_inode)) >> SECTOR_SHIFT;


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/8] iomap: split bios to zone append limits in the submission handlers
  2024-12-12 13:28   ` Brian Foster
@ 2024-12-12 15:05     ` Christoph Hellwig
  0 siblings, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2024-12-12 15:05 UTC (permalink / raw)
  To: Brian Foster
  Cc: Christoph Hellwig, Christian Brauner, Darrick J. Wong,
	Carlos Maiolino, linux-xfs, linux-fsdevel

On Thu, Dec 12, 2024 at 08:28:17AM -0500, Brian Foster wrote:
> It might be useful to add a small comment here to point out this splits
> from the front of the ioend (i.e. akin to bio_split()), documents the
> params, and maybe mentions the ioend relationship requirements (i.e.
> according to bio_split(), the split ioend bio refers to the vectors in
> the original ioend bio).

Sure, I can add that.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/8] iomap: split bios to zone append limits in the submission handlers
  2024-12-12 14:21   ` John Garry
@ 2024-12-12 15:07     ` Christoph Hellwig
  0 siblings, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2024-12-12 15:07 UTC (permalink / raw)
  To: John Garry
  Cc: Christoph Hellwig, Christian Brauner, Darrick J. Wong,
	Carlos Maiolino, linux-xfs, linux-fsdevel

On Thu, Dec 12, 2024 at 02:21:32PM +0000, John Garry wrote:
> On 11/12/2024 08:53, Christoph Hellwig wrote:
>> +	if (is_append) {
>> +		struct queue_limits *lim = bdev_limits(bio->bi_bdev);
>> +
>> +		sector_offset = bio_split_rw_at(bio, lim, &nr_segs,
>> +			min(lim->max_zone_append_sectors << SECTOR_SHIFT,
>> +			    *alloc_len));
>> +		if (!sector_offset)
>
> Should this be:
>
> 		if (sector_offset <= 0)

No support for REQ_ATOMIC and REQ_NOWAIT in this path right now,
but we might as well future prove it by checking for a negative
error value.  But we'll then need to propagate the error as well.
I'll see what I can do.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/8] iomap: optionally use ioends for direct I/O
  2024-12-12 13:29   ` Brian Foster
@ 2024-12-12 15:12     ` Christoph Hellwig
  0 siblings, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2024-12-12 15:12 UTC (permalink / raw)
  To: Brian Foster
  Cc: Christoph Hellwig, Christian Brauner, Darrick J. Wong,
	Carlos Maiolino, linux-xfs, linux-fsdevel

On Thu, Dec 12, 2024 at 08:29:36AM -0500, Brian Foster wrote:
> > +	bool should_dirty = (dio->flags & IOMAP_DIO_DIRTY);
> > +	struct kiocb *iocb = dio->iocb;
> > +	u32 vec_count = ioend->io_bio.bi_vcnt;
> > +
> > +	if (ioend->io_error)
> > +		iomap_dio_set_error(dio, ioend->io_error);
> > +
> > +	if (atomic_dec_and_test(&dio->ref)) {
> > +		struct inode *inode = file_inode(iocb->ki_filp);
> > +
> > +		if (dio->wait_for_completion) {
> > +			struct task_struct *waiter = dio->submit.waiter;
> > +
> > +			WRITE_ONCE(dio->submit.waiter, NULL);
> > +			blk_wake_io_task(waiter);
> > +		} else if (!inode->i_mapping->nrpages) {
> > +			WRITE_ONCE(iocb->private, NULL);
> > +
> > +			/*
> > +			 * We must never invalidate pages from this thread to
> > +			 * avoid deadlocks with buffered I/O completions.
> > +			 * Tough luck if you hit the tiny race with someone
> > +			 * dirtying the range now.
> > +			 */
> > +			dio->flags |= IOMAP_DIO_NO_INVALIDATE;
> > +			iomap_dio_complete_work(&dio->aio.work);
> > +		} else {
> > +			INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
> > +			queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work);
> > +		}
> > +	}
> > +
> > +	if (should_dirty) {
> > +		bio_check_pages_dirty(&ioend->io_bio);
> > +	} else {
> > +		bio_release_pages(&ioend->io_bio, false);
> > +		bio_put(&ioend->io_bio);
> > +	}
> > +
> 
> Not that it matters all that much, but I'm a little curious about the
> reasoning for using vec_count here. AFAICS this correlates to per-folio
> writeback completions for buffered I/O, but that doesn't seem to apply
> to direct I/O. Is there a reason to have the caller throttle based on
> vec_counts, or are we just pulling some non-zero value for consistency
> sake?

So direct I/O also iterates over all folios for the bio, to unpin,
and in case of reads dirty all of them.

I wanted to plug something useful into cond_resched condition in the
caller.  Now number of bvecs isn't the number of folios as we can
physically merge outside the folio context, but I think this is about
as goot as it gets without changing the block code to return the
number of folios processed from __bio_release_pages and
bio_check_pages_dirty.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/8] iomap: allow the file system to submit the writeback bios
  2024-12-11  8:53 ` [PATCH 1/8] iomap: allow the file system to submit the writeback bios Christoph Hellwig
@ 2024-12-12 17:48   ` Darrick J. Wong
  0 siblings, 0 replies; 30+ messages in thread
From: Darrick J. Wong @ 2024-12-12 17:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Christian Brauner, Carlos Maiolino, linux-xfs, linux-fsdevel

On Wed, Dec 11, 2024 at 09:53:41AM +0100, Christoph Hellwig wrote:
> Change ->prepare_ioend to ->submit_ioend and require file systems that
> implement it to submit the bio.  This is needed for file systems that
> do their own work on the bios before submitting them to the block layer
> like btrfs or zoned xfs.  To make this easier also pass the writeback
> context to the method.

The code changes here are pretty straightforward, but please update
Documentation/filesystems/iomap/operations.rst to reflect the new name
and the new "submit the bio yourself" behavior expected of the
implementation.

--D

> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/iomap/buffered-io.c | 10 +++++-----
>  fs/xfs/xfs_aops.c      | 13 +++++++++----
>  include/linux/iomap.h  | 12 +++++++-----
>  3 files changed, 21 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 955f19e27e47..cdccf11bb3be 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -1675,7 +1675,7 @@ static void iomap_writepage_end_bio(struct bio *bio)
>  }
>  
>  /*
> - * Submit the final bio for an ioend.
> + * Submit an ioend.
>   *
>   * If @error is non-zero, it means that we have a situation where some part of
>   * the submission process has failed after we've marked pages for writeback.
> @@ -1694,14 +1694,14 @@ static int iomap_submit_ioend(struct iomap_writepage_ctx *wpc, int error)
>  	 * failure happened so that the file system end I/O handler gets called
>  	 * to clean up.
>  	 */
> -	if (wpc->ops->prepare_ioend)
> -		error = wpc->ops->prepare_ioend(wpc->ioend, error);
> +	if (wpc->ops->submit_ioend)
> +		error = wpc->ops->submit_ioend(wpc, error);
> +	else if (!error)
> +		submit_bio(&wpc->ioend->io_bio);
>  
>  	if (error) {
>  		wpc->ioend->io_bio.bi_status = errno_to_blk_status(error);
>  		bio_endio(&wpc->ioend->io_bio);
> -	} else {
> -		submit_bio(&wpc->ioend->io_bio);
>  	}
>  
>  	wpc->ioend = NULL;
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 559a3a577097..d175853da5ae 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -395,10 +395,11 @@ xfs_map_blocks(
>  }
>  
>  static int
> -xfs_prepare_ioend(
> -	struct iomap_ioend	*ioend,
> +xfs_submit_ioend(
> +	struct iomap_writepage_ctx *wpc,
>  	int			status)
>  {
> +	struct iomap_ioend	*ioend = wpc->ioend;
>  	unsigned int		nofs_flag;
>  
>  	/*
> @@ -420,7 +421,11 @@ xfs_prepare_ioend(
>  	if (xfs_ioend_is_append(ioend) || ioend->io_type == IOMAP_UNWRITTEN ||
>  	    (ioend->io_flags & IOMAP_F_SHARED))
>  		ioend->io_bio.bi_end_io = xfs_end_bio;
> -	return status;
> +
> +	if (status)
> +		return status;
> +	submit_bio(&ioend->io_bio);
> +	return 0;
>  }
>  
>  /*
> @@ -462,7 +467,7 @@ xfs_discard_folio(
>  
>  static const struct iomap_writeback_ops xfs_writeback_ops = {
>  	.map_blocks		= xfs_map_blocks,
> -	.prepare_ioend		= xfs_prepare_ioend,
> +	.submit_ioend		= xfs_submit_ioend,
>  	.discard_folio		= xfs_discard_folio,
>  };
>  
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 5675af6b740c..c0339678d798 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -362,12 +362,14 @@ struct iomap_writeback_ops {
>  			  loff_t offset, unsigned len);
>  
>  	/*
> -	 * Optional, allows the file systems to perform actions just before
> -	 * submitting the bio and/or override the bio end_io handler for complex
> -	 * operations like copy on write extent manipulation or unwritten extent
> -	 * conversions.
> +	 * Optional, allows the file systems to hook into bio submission,
> +	 * including overriding the bi_end_io handler.
> +	 *
> +	 * Returns 0 if the bio was successfully submitted, or a negative
> +	 * error code if status was non-zero or another error happened and
> +	 * the bio could not be submitted.
>  	 */
> -	int (*prepare_ioend)(struct iomap_ioend *ioend, int status);
> +	int (*submit_ioend)(struct iomap_writepage_ctx *wpc, int status);
>  
>  	/*
>  	 * Optional, allows the file system to discard state on a page where
> -- 
> 2.45.2
> 
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/8] iomap: simplify io_flags and io_type in struct iomap_ioend
  2024-12-11  8:53 ` [PATCH 2/8] iomap: simplify io_flags and io_type in struct iomap_ioend Christoph Hellwig
@ 2024-12-12 17:55   ` Darrick J. Wong
  2024-12-13  4:53     ` Christoph Hellwig
  0 siblings, 1 reply; 30+ messages in thread
From: Darrick J. Wong @ 2024-12-12 17:55 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Christian Brauner, Carlos Maiolino, linux-xfs, linux-fsdevel

On Wed, Dec 11, 2024 at 09:53:42AM +0100, Christoph Hellwig wrote:
> The ioend fields for distinct types of I/O are a bit complicated.
> Consolidate them into a single io_flag field with it's own flags
> decoupled from the iomap flags.  This also prepares for adding a new
> flag that is unrelated to both of the iomap namespaces.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/iomap/buffered-io.c | 39 ++++++++++++++++++++++-----------------
>  fs/xfs/xfs_aops.c      | 12 ++++++------
>  include/linux/iomap.h  | 16 ++++++++++++++--
>  3 files changed, 42 insertions(+), 25 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index cdccf11bb3be..3176dc996fb7 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -1605,13 +1605,10 @@ iomap_ioend_can_merge(struct iomap_ioend *ioend, struct iomap_ioend *next)
>  {
>  	if (ioend->io_bio.bi_status != next->io_bio.bi_status)
>  		return false;
> -	if (next->io_flags & IOMAP_F_BOUNDARY)
> +	if (next->io_flags & IOMAP_IOEND_BOUNDARY)
>  		return false;
> -	if ((ioend->io_flags & IOMAP_F_SHARED) ^
> -	    (next->io_flags & IOMAP_F_SHARED))
> -		return false;
> -	if ((ioend->io_type == IOMAP_UNWRITTEN) ^
> -	    (next->io_type == IOMAP_UNWRITTEN))
> +	if ((ioend->io_flags & IOMAP_IOEND_NOMERGE_FLAGS) !=
> +	    (next->io_flags & IOMAP_IOEND_NOMERGE_FLAGS))
>  		return false;
>  	if (ioend->io_offset + ioend->io_size != next->io_offset)
>  		return false;
> @@ -1709,7 +1706,8 @@ static int iomap_submit_ioend(struct iomap_writepage_ctx *wpc, int error)
>  }
>  
>  static struct iomap_ioend *iomap_alloc_ioend(struct iomap_writepage_ctx *wpc,
> -		struct writeback_control *wbc, struct inode *inode, loff_t pos)
> +		struct writeback_control *wbc, struct inode *inode, loff_t pos,
> +		u16 ioend_flags)
>  {
>  	struct iomap_ioend *ioend;
>  	struct bio *bio;
> @@ -1724,8 +1722,7 @@ static struct iomap_ioend *iomap_alloc_ioend(struct iomap_writepage_ctx *wpc,
>  
>  	ioend = iomap_ioend_from_bio(bio);
>  	INIT_LIST_HEAD(&ioend->io_list);
> -	ioend->io_type = wpc->iomap.type;
> -	ioend->io_flags = wpc->iomap.flags;
> +	ioend->io_flags = ioend_flags;
>  	if (pos > wpc->iomap.offset)
>  		wpc->iomap.flags &= ~IOMAP_F_BOUNDARY;
>  	ioend->io_inode = inode;
> @@ -1737,14 +1734,13 @@ static struct iomap_ioend *iomap_alloc_ioend(struct iomap_writepage_ctx *wpc,
>  	return ioend;
>  }
>  
> -static bool iomap_can_add_to_ioend(struct iomap_writepage_ctx *wpc, loff_t pos)
> +static bool iomap_can_add_to_ioend(struct iomap_writepage_ctx *wpc, loff_t pos,
> +		u16 ioend_flags)
>  {
> -	if (wpc->iomap.offset == pos && (wpc->iomap.flags & IOMAP_F_BOUNDARY))
> -		return false;
> -	if ((wpc->iomap.flags & IOMAP_F_SHARED) !=
> -	    (wpc->ioend->io_flags & IOMAP_F_SHARED))
> +	if (ioend_flags & IOMAP_IOEND_BOUNDARY)
>  		return false;
> -	if (wpc->iomap.type != wpc->ioend->io_type)
> +	if ((ioend_flags & IOMAP_IOEND_NOMERGE_FLAGS) !=
> +	    (wpc->ioend->io_flags & IOMAP_IOEND_NOMERGE_FLAGS))
>  		return false;
>  	if (pos != wpc->ioend->io_offset + wpc->ioend->io_size)
>  		return false;
> @@ -1778,14 +1774,23 @@ static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc,
>  {
>  	struct iomap_folio_state *ifs = folio->private;
>  	size_t poff = offset_in_folio(folio, pos);
> +	unsigned int ioend_flags = 0;
>  	int error;
>  
> -	if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, pos)) {
> +	if (wpc->iomap.type == IOMAP_UNWRITTEN)
> +		ioend_flags |= IOMAP_IOEND_UNWRITTEN;
> +	if (wpc->iomap.flags & IOMAP_F_SHARED)
> +		ioend_flags |= IOMAP_IOEND_SHARED;
> +	if (pos == wpc->iomap.offset && (wpc->iomap.flags & IOMAP_F_BOUNDARY))
> +		ioend_flags |= IOMAP_IOEND_BOUNDARY;
> +
> +	if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, pos, ioend_flags)) {
>  new_ioend:
>  		error = iomap_submit_ioend(wpc, 0);
>  		if (error)
>  			return error;
> -		wpc->ioend = iomap_alloc_ioend(wpc, wbc, inode, pos);
> +		wpc->ioend = iomap_alloc_ioend(wpc, wbc, inode, pos,
> +				ioend_flags);
>  	}
>  
>  	if (!bio_add_folio(&wpc->ioend->io_bio, folio, len, poff))
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index d175853da5ae..d35ac4c19fb2 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -114,7 +114,7 @@ xfs_end_ioend(
>  	 */
>  	error = blk_status_to_errno(ioend->io_bio.bi_status);
>  	if (unlikely(error)) {
> -		if (ioend->io_flags & IOMAP_F_SHARED) {
> +		if (ioend->io_flags & IOMAP_IOEND_SHARED) {
>  			xfs_reflink_cancel_cow_range(ip, offset, size, true);
>  			xfs_bmap_punch_delalloc_range(ip, XFS_DATA_FORK, offset,
>  					offset + size);
> @@ -125,9 +125,9 @@ xfs_end_ioend(
>  	/*
>  	 * Success: commit the COW or unwritten blocks if needed.
>  	 */
> -	if (ioend->io_flags & IOMAP_F_SHARED)
> +	if (ioend->io_flags & IOMAP_IOEND_SHARED)
>  		error = xfs_reflink_end_cow(ip, offset, size);
> -	else if (ioend->io_type == IOMAP_UNWRITTEN)
> +	else if (ioend->io_flags & IOMAP_IOEND_UNWRITTEN)
>  		error = xfs_iomap_write_unwritten(ip, offset, size, false);
>  
>  	if (!error && xfs_ioend_is_append(ioend))
> @@ -410,7 +410,7 @@ xfs_submit_ioend(
>  	nofs_flag = memalloc_nofs_save();
>  
>  	/* Convert CoW extents to regular */
> -	if (!status && (ioend->io_flags & IOMAP_F_SHARED)) {
> +	if (!status && (ioend->io_flags & IOMAP_IOEND_SHARED)) {
>  		status = xfs_reflink_convert_cow(XFS_I(ioend->io_inode),
>  				ioend->io_offset, ioend->io_size);
>  	}
> @@ -418,8 +418,8 @@ xfs_submit_ioend(
>  	memalloc_nofs_restore(nofs_flag);
>  
>  	/* send ioends that might require a transaction to the completion wq */
> -	if (xfs_ioend_is_append(ioend) || ioend->io_type == IOMAP_UNWRITTEN ||
> -	    (ioend->io_flags & IOMAP_F_SHARED))
> +	if (xfs_ioend_is_append(ioend) ||
> +	    (ioend->io_flags & (IOMAP_IOEND_UNWRITTEN | IOMAP_IOEND_SHARED)))
>  		ioend->io_bio.bi_end_io = xfs_end_bio;
>  
>  	if (status)
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index c0339678d798..1d8658c7beb8 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -327,13 +327,25 @@ loff_t iomap_seek_data(struct inode *inode, loff_t offset,
>  sector_t iomap_bmap(struct address_space *mapping, sector_t bno,
>  		const struct iomap_ops *ops);
>  
> +/*
> + * Flags for iomap_ioend->io_flags.
> + */
> +/* shared COW extent */
> +#define IOMAP_IOEND_SHARED		(1U << 0)
> +/* unwritten extent */
> +#define IOMAP_IOEND_UNWRITTEN		(1U << 1)
> +/* don't merge into previous ioend */
> +#define IOMAP_IOEND_BOUNDARY		(1U << 2)
> +
> +#define IOMAP_IOEND_NOMERGE_FLAGS \
> +	(IOMAP_IOEND_SHARED | IOMAP_IOEND_UNWRITTEN)

Hmm.  At first I wondered "Why wouldn't BOUNDARY be in here too?  It
also prevents merging of ioends."  Then I remembered that BOUNDARY is an
explicit nomerge flag, whereas what NOMERGE_FLAGS provides is that we
always split ioends whenever the ioend work changes.

How about a comment?

/* split ioends when the type of completion work changes */
#define IOMAP_IOEND_NOMERGE_FLAGS \
	(IOMAP_IOEND_SHARED | IOMAP_IOEND_UNWRITTEN)

Otherwise this looks fine to me.

--D

> +
>  /*
>   * Structure for writeback I/O completions.
>   */
>  struct iomap_ioend {
>  	struct list_head	io_list;	/* next ioend in chain */
> -	u16			io_type;
> -	u16			io_flags;	/* IOMAP_F_* */
> +	u16			io_flags;	/* IOMAP_IOEND_* */
>  	struct inode		*io_inode;	/* file being written to */
>  	size_t			io_size;	/* size of the extent */
>  	loff_t			io_offset;	/* offset in the file */
> -- 
> 2.45.2
> 
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/8] iomap: add a IOMAP_F_ZONE_APPEND flag
  2024-12-11  8:53 ` [PATCH 3/8] iomap: add a IOMAP_F_ZONE_APPEND flag Christoph Hellwig
@ 2024-12-12 18:05   ` Darrick J. Wong
  2024-12-13  4:55     ` Christoph Hellwig
  2024-12-16  4:55     ` Christoph Hellwig
  0 siblings, 2 replies; 30+ messages in thread
From: Darrick J. Wong @ 2024-12-12 18:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Christian Brauner, Carlos Maiolino, linux-xfs, linux-fsdevel

On Wed, Dec 11, 2024 at 09:53:43AM +0100, Christoph Hellwig wrote:
> This doesn't much - just always returns the start block number for each

"This doesn't much" - I don't understand the sentence very well.  How
about:

"Add a new IOMAP_F_ZONE_APPEND flag for the filesystem to indicate that
the storage device must inform us where it wrote the data, so that the
filesystem can update its own internal mapping metadata.  The filesystem
should set the starting address of the zone in iomap::addr, and extract
the LBA address from the bio during ioend processing.  iomap builds
bios unconstrained by the hardware limits and will split them in the bio
submission handler."

The splitting happens whenever IOMAP_F_BOUNDARY gets set by
->map_blocks, right?

> iomap instead of increasing it.  This is because we'll keep building bios
> unconstrained by the hardware limits and just split them in file system
> submission handler.
> 
> Maybe we should find another name for it, because it might be useful for
> btrfs compressed bio submissions as well, but I can't come up with a
> good one.

Since you have to tell the device the starting LBA of the zone you want
to write to, I think _ZONE_APPEND is a reasonable name.  The code looks
correct though.

> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/iomap/buffered-io.c | 19 ++++++++++++++++---
>  include/linux/iomap.h  |  7 +++++++
>  2 files changed, 23 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 3176dc996fb7..129cd96c6c96 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -1744,9 +1744,22 @@ static bool iomap_can_add_to_ioend(struct iomap_writepage_ctx *wpc, loff_t pos,
>  		return false;
>  	if (pos != wpc->ioend->io_offset + wpc->ioend->io_size)
>  		return false;
> -	if (iomap_sector(&wpc->iomap, pos) !=
> -	    bio_end_sector(&wpc->ioend->io_bio))
> -		return false;
> +	if (wpc->iomap.flags & IOMAP_F_ZONE_APPEND) {
> +		/*
> +		 * For Zone Append command, bi_sector points to the zone start
> +		 * before submission.  We can merge all I/O for the same zone.
> +		 */
> +		if (iomap_sector(&wpc->iomap, pos) !=
> +		    wpc->ioend->io_bio.bi_iter.bi_sector)
> +			return false;
> +	} else {
> +		/*
> +		 * For regular writes, the disk blocks needs to be contiguous.
> +		 */
> +		if (iomap_sector(&wpc->iomap, pos) !=
> +		    bio_end_sector(&wpc->ioend->io_bio))
> +			return false;
> +	}
>  	/*
>  	 * Limit ioend bio chain lengths to minimise IO completion latency. This
>  	 * also prevents long tight loops ending page writeback on all the
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 1d8658c7beb8..173d490c20ba 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -56,6 +56,10 @@ struct vm_fault;
>   *
>   * IOMAP_F_BOUNDARY indicates that I/O and I/O completions for this iomap must
>   * never be merged with the mapping before it.
> + *
> + * IOMAP_F_ZONE_APPEND indicates that (write) I/O should be done as a zone
> + * append command for zoned devices.  Note that the file system needs to
> + * override the bi_end_io handler to record the actual written sector.
>   */
>  #define IOMAP_F_NEW		(1U << 0)
>  #define IOMAP_F_DIRTY		(1U << 1)
> @@ -68,6 +72,7 @@ struct vm_fault;
>  #endif /* CONFIG_BUFFER_HEAD */
>  #define IOMAP_F_XATTR		(1U << 5)
>  #define IOMAP_F_BOUNDARY	(1U << 6)
> +#define IOMAP_F_ZONE_APPEND	(1U << 7)

Needs a corresponding update in Documentation/iomap/ before we merge
this series.

--D

>  
>  /*
>   * Flags set by the core iomap code during operations:
> @@ -111,6 +116,8 @@ struct iomap {
>  
>  static inline sector_t iomap_sector(const struct iomap *iomap, loff_t pos)
>  {
> +	if (iomap->flags & IOMAP_F_ZONE_APPEND)
> +		return iomap->addr >> SECTOR_SHIFT;
>  	return (iomap->addr + pos - iomap->offset) >> SECTOR_SHIFT;
>  }
>  
> -- 
> 2.45.2
> 
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/8] iomap: split bios to zone append limits in the submission handlers
  2024-12-11  8:53 ` [PATCH 4/8] iomap: split bios to zone append limits in the submission handlers Christoph Hellwig
  2024-12-12 13:28   ` Brian Foster
  2024-12-12 14:21   ` John Garry
@ 2024-12-12 19:51   ` Darrick J. Wong
  2024-12-13  4:50     ` Christoph Hellwig
  2 siblings, 1 reply; 30+ messages in thread
From: Darrick J. Wong @ 2024-12-12 19:51 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Christian Brauner, Carlos Maiolino, linux-xfs, linux-fsdevel

On Wed, Dec 11, 2024 at 09:53:44AM +0100, Christoph Hellwig wrote:
> Provide helpers for file systems to split bios in the direct I/O and
> writeback I/O submission handlers.
> 
> This Follows btrfs' lead and don't try to build bios to hardware limits
> for zone append commands, but instead build them as normal unconstrained
> bios and split them to the hardware limits in the I/O submission handler.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/iomap/Makefile      |  1 +
>  fs/iomap/buffered-io.c | 43 ++++++++++++++-----------
>  fs/iomap/ioend.c       | 73 ++++++++++++++++++++++++++++++++++++++++++
>  include/linux/iomap.h  |  9 ++++++
>  4 files changed, 108 insertions(+), 18 deletions(-)
>  create mode 100644 fs/iomap/ioend.c
> 
> diff --git a/fs/iomap/Makefile b/fs/iomap/Makefile
> index 381d76c5c232..69e8ebb41302 100644
> --- a/fs/iomap/Makefile
> +++ b/fs/iomap/Makefile
> @@ -12,6 +12,7 @@ iomap-y				+= trace.o \
>  				   iter.o
>  iomap-$(CONFIG_BLOCK)		+= buffered-io.o \
>  				   direct-io.o \
> +				   ioend.o \
>  				   fiemap.o \
>  				   seek.o
>  iomap-$(CONFIG_SWAP)		+= swapfile.o
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 129cd96c6c96..8125f758a99d 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -40,7 +40,8 @@ struct iomap_folio_state {
>  	unsigned long		state[];
>  };
>  
> -static struct bio_set iomap_ioend_bioset;
> +struct bio_set iomap_ioend_bioset;
> +EXPORT_SYMBOL_GPL(iomap_ioend_bioset);
>  
>  static inline bool ifs_is_fully_uptodate(struct folio *folio,
>  		struct iomap_folio_state *ifs)
> @@ -1539,15 +1540,15 @@ static void iomap_finish_folio_write(struct inode *inode, struct folio *folio,
>   * ioend after this.
>   */
>  static u32
> -iomap_finish_ioend(struct iomap_ioend *ioend, int error)
> +iomap_finish_ioend_buffered(struct iomap_ioend *ioend)
>  {
>  	struct inode *inode = ioend->io_inode;
>  	struct bio *bio = &ioend->io_bio;
>  	struct folio_iter fi;
>  	u32 folio_count = 0;
>  
> -	if (error) {
> -		mapping_set_error(inode->i_mapping, error);
> +	if (ioend->io_error) {
> +		mapping_set_error(inode->i_mapping, ioend->io_error);
>  		if (!bio_flagged(bio, BIO_QUIET)) {
>  			pr_err_ratelimited(
>  "%s: writeback error on inode %lu, offset %lld, sector %llu",
> @@ -1566,6 +1567,24 @@ iomap_finish_ioend(struct iomap_ioend *ioend, int error)
>  	return folio_count;
>  }
>  
> +static u32
> +iomap_finish_ioend(struct iomap_ioend *ioend, int error)
> +{
> +	if (ioend->io_parent) {
> +		struct bio *bio = &ioend->io_bio;
> +
> +		ioend = ioend->io_parent;
> +		bio_put(bio);
> +	}
> +
> +	if (error)
> +		cmpxchg(&ioend->io_error, 0, error);
> +
> +	if (!atomic_dec_and_test(&ioend->io_remaining))
> +		return 0;
> +	return iomap_finish_ioend_buffered(ioend);
> +}
> +
>  /*
>   * Ioend completion routine for merged bios. This can only be called from task
>   * contexts as merged ioends can be of unbound length. Hence we have to break up
> @@ -1709,7 +1728,6 @@ static struct iomap_ioend *iomap_alloc_ioend(struct iomap_writepage_ctx *wpc,
>  		struct writeback_control *wbc, struct inode *inode, loff_t pos,
>  		u16 ioend_flags)
>  {
> -	struct iomap_ioend *ioend;
>  	struct bio *bio;
>  
>  	bio = bio_alloc_bioset(wpc->iomap.bdev, BIO_MAX_VECS,
> @@ -1717,21 +1735,10 @@ static struct iomap_ioend *iomap_alloc_ioend(struct iomap_writepage_ctx *wpc,
>  			       GFP_NOFS, &iomap_ioend_bioset);
>  	bio->bi_iter.bi_sector = iomap_sector(&wpc->iomap, pos);
>  	bio->bi_end_io = iomap_writepage_end_bio;
> -	wbc_init_bio(wbc, bio);
>  	bio->bi_write_hint = inode->i_write_hint;
> -
> -	ioend = iomap_ioend_from_bio(bio);
> -	INIT_LIST_HEAD(&ioend->io_list);
> -	ioend->io_flags = ioend_flags;
> -	if (pos > wpc->iomap.offset)
> -		wpc->iomap.flags &= ~IOMAP_F_BOUNDARY;
> -	ioend->io_inode = inode;
> -	ioend->io_size = 0;
> -	ioend->io_offset = pos;
> -	ioend->io_sector = bio->bi_iter.bi_sector;
> -
> +	wbc_init_bio(wbc, bio);
>  	wpc->nr_folios = 0;
> -	return ioend;
> +	return iomap_init_ioend(inode, bio, pos, ioend_flags);
>  }
>  
>  static bool iomap_can_add_to_ioend(struct iomap_writepage_ctx *wpc, loff_t pos,
> diff --git a/fs/iomap/ioend.c b/fs/iomap/ioend.c
> new file mode 100644
> index 000000000000..f3d98121c593
> --- /dev/null
> +++ b/fs/iomap/ioend.c
> @@ -0,0 +1,73 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (c) 2024 Christoph Hellwig.
> + */
> +#include <linux/iomap.h>
> +
> +struct iomap_ioend *iomap_init_ioend(struct inode *inode,
> +		struct bio *bio, loff_t file_offset, u16 flags)
> +{

Nit: s/flags/ioend_flags/ to be consistent with the previous few
patches.

> +	struct iomap_ioend *ioend = iomap_ioend_from_bio(bio);
> +
> +	atomic_set(&ioend->io_remaining, 1);
> +	ioend->io_error = 0;
> +	ioend->io_parent = NULL;
> +	INIT_LIST_HEAD(&ioend->io_list);
> +	ioend->io_flags = flags;
> +	ioend->io_inode = inode;
> +	ioend->io_offset = file_offset;
> +	ioend->io_size = bio->bi_iter.bi_size;
> +	ioend->io_sector = bio->bi_iter.bi_sector;
> +	return ioend;
> +}
> +EXPORT_SYMBOL_GPL(iomap_init_ioend);
> +
> +struct iomap_ioend *iomap_split_ioend(struct iomap_ioend *ioend, bool is_append,

Can you determine is_append from (ioend->io_flags & ZONE_APPEND)?

Also it's not clear to me what the initial and output state of
*alloc_len is supposed to be?  I guess you set it to the number of bytes
the @ioend covers?  And this function either returns NULL and alloc_len
untouched; or it returns a new ioend and the number of bytes remaining
in the passed-in ioend?

(or, as bfoster said, please improve the comments)

> +		unsigned int *alloc_len)
> +{
> +	struct bio *bio = &ioend->io_bio;
> +	struct iomap_ioend *split_ioend;
> +	struct bio *split;
> +	int sector_offset;
> +	unsigned int nr_segs;
> +
> +	if (is_append) {
> +		struct queue_limits *lim = bdev_limits(bio->bi_bdev);
> +
> +		sector_offset = bio_split_rw_at(bio, lim, &nr_segs,
> +			min(lim->max_zone_append_sectors << SECTOR_SHIFT,
> +			    *alloc_len));
> +		if (!sector_offset)
> +			return NULL;
> +	} else {
> +		if (bio->bi_iter.bi_size <= *alloc_len)
> +			return NULL;
> +		sector_offset = *alloc_len >> SECTOR_SHIFT;
> +	}
> +
> +	/* ensure the split ioend is still block size aligned */
> +	sector_offset = ALIGN_DOWN(sector_offset << SECTOR_SHIFT,
> +			i_blocksize(ioend->io_inode)) >> SECTOR_SHIFT;
> +
> +	split = bio_split(bio, sector_offset, GFP_NOFS, &iomap_ioend_bioset);
> +	if (!split)
> +		return NULL;
> +	split->bi_private = bio->bi_private;
> +	split->bi_end_io = bio->bi_end_io;
> +
> +	split_ioend = iomap_init_ioend(ioend->io_inode, split, ioend->io_offset,
> +			ioend->io_flags);
> +	split_ioend->io_parent = ioend;
> +
> +	atomic_inc(&ioend->io_remaining);
> +	ioend->io_offset += split_ioend->io_size;
> +	ioend->io_size -= split_ioend->io_size;
> +
> +	split_ioend->io_sector = ioend->io_sector;
> +	if (!is_append)
> +		ioend->io_sector += (split_ioend->io_size >> SECTOR_SHIFT);
> +
> +	*alloc_len -= split->bi_iter.bi_size;
> +	return split_ioend;
> +}
> +EXPORT_SYMBOL_GPL(iomap_split_ioend);
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 173d490c20ba..eaa8cb9083eb 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -354,6 +354,9 @@ struct iomap_ioend {
>  	struct list_head	io_list;	/* next ioend in chain */
>  	u16			io_flags;	/* IOMAP_IOEND_* */
>  	struct inode		*io_inode;	/* file being written to */
> +	atomic_t		io_remaining;	/* completetion defer count */
> +	int			io_error;	/* stashed away status */
> +	struct iomap_ioend	*io_parent;	/* parent for completions */

I guess this means ioends can chain together, sort of like how bios can
when you split them?

--D

>  	size_t			io_size;	/* size of the extent */
>  	loff_t			io_offset;	/* offset in the file */
>  	sector_t		io_sector;	/* start sector of ioend */
> @@ -404,6 +407,10 @@ struct iomap_writepage_ctx {
>  	u32			nr_folios;	/* folios added to the ioend */
>  };
>  
> +struct iomap_ioend *iomap_init_ioend(struct inode *inode, struct bio *bio,
> +		loff_t file_offset, u16 flags);
> +struct iomap_ioend *iomap_split_ioend(struct iomap_ioend *ioend, bool is_append,
> +		unsigned int *alloc_len);
>  void iomap_finish_ioends(struct iomap_ioend *ioend, int error);
>  void iomap_ioend_try_merge(struct iomap_ioend *ioend,
>  		struct list_head *more_ioends);
> @@ -475,4 +482,6 @@ int iomap_swapfile_activate(struct swap_info_struct *sis,
>  # define iomap_swapfile_activate(sis, swapfile, pagespan, ops)	(-EIO)
>  #endif /* CONFIG_SWAP */
>  
> +extern struct bio_set iomap_ioend_bioset;
> +
>  #endif /* LINUX_IOMAP_H */
> -- 
> 2.45.2
> 
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/8] iomap: optionally use ioends for direct I/O
  2024-12-11  8:53 ` [PATCH 5/8] iomap: optionally use ioends for direct I/O Christoph Hellwig
  2024-12-12 13:29   ` Brian Foster
@ 2024-12-12 19:56   ` Darrick J. Wong
  2024-12-13  4:51     ` Christoph Hellwig
  1 sibling, 1 reply; 30+ messages in thread
From: Darrick J. Wong @ 2024-12-12 19:56 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Christian Brauner, Carlos Maiolino, linux-xfs, linux-fsdevel

On Wed, Dec 11, 2024 at 09:53:45AM +0100, Christoph Hellwig wrote:
> struct iomap_ioend currently tracks outstanding buffered writes and has
> some really nice code in core iomap and XFS to merge contiguous I/Os
> an defer them to userspace for completion in a very efficient way.
> 
> For zoned writes we'll also need a per-bio user context completion to
> record the written blocks, and the infrastructure for that would look
> basically like the ioend handling for buffered I/O.
> 
> So instead of reinventing the wheel, reuse the existing infrastructure.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/iomap/buffered-io.c |  3 +++
>  fs/iomap/direct-io.c   | 50 +++++++++++++++++++++++++++++++++++++++++-
>  fs/iomap/internal.h    |  7 ++++++
>  include/linux/iomap.h  |  4 +++-
>  4 files changed, 62 insertions(+), 2 deletions(-)
>  create mode 100644 fs/iomap/internal.h
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 8125f758a99d..ceca9473a09c 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -17,6 +17,7 @@
>  #include <linux/bio.h>
>  #include <linux/sched/signal.h>
>  #include <linux/migrate.h>
> +#include "internal.h"
>  #include "trace.h"
>  
>  #include "../internal.h"
> @@ -1582,6 +1583,8 @@ iomap_finish_ioend(struct iomap_ioend *ioend, int error)
>  
>  	if (!atomic_dec_and_test(&ioend->io_remaining))
>  		return 0;
> +	if (ioend->io_flags & IOMAP_IOEND_DIRECT)
> +		return iomap_finish_ioend_direct(ioend);
>  	return iomap_finish_ioend_buffered(ioend);
>  }

I'm a little surprised that more of the iomap_ioend* functions didn't
end up in ioend.c.

> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index b521eb15759e..b5466361cafe 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -12,6 +12,7 @@
>  #include <linux/backing-dev.h>
>  #include <linux/uio.h>
>  #include <linux/task_io_accounting_ops.h>
> +#include "internal.h"
>  #include "trace.h"
>  
>  #include "../internal.h"
> @@ -20,6 +21,7 @@
>   * Private flags for iomap_dio, must not overlap with the public ones in
>   * iomap.h:
>   */
> +#define IOMAP_DIO_NO_INVALIDATE	(1U << 25)
>  #define IOMAP_DIO_CALLER_COMP	(1U << 26)
>  #define IOMAP_DIO_INLINE_COMP	(1U << 27)
>  #define IOMAP_DIO_WRITE_THROUGH	(1U << 28)
> @@ -117,7 +119,8 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
>  	 * ->end_io() when necessary, otherwise a racing buffer read would cache
>  	 * zeros from unwritten extents.
>  	 */
> -	if (!dio->error && dio->size && (dio->flags & IOMAP_DIO_WRITE))
> +	if (!dio->error && dio->size && (dio->flags & IOMAP_DIO_WRITE) &&
> +	    !(dio->flags & IOMAP_DIO_NO_INVALIDATE))
>  		kiocb_invalidate_post_direct_write(iocb, dio->size);
>  
>  	inode_dio_end(file_inode(iocb->ki_filp));
> @@ -163,6 +166,51 @@ static inline void iomap_dio_set_error(struct iomap_dio *dio, int ret)
>  	cmpxchg(&dio->error, 0, ret);
>  }
>  
> +u32 iomap_finish_ioend_direct(struct iomap_ioend *ioend)
> +{
> +	struct iomap_dio *dio = ioend->io_bio.bi_private;
> +	bool should_dirty = (dio->flags & IOMAP_DIO_DIRTY);
> +	struct kiocb *iocb = dio->iocb;
> +	u32 vec_count = ioend->io_bio.bi_vcnt;
> +
> +	if (ioend->io_error)
> +		iomap_dio_set_error(dio, ioend->io_error);
> +
> +	if (atomic_dec_and_test(&dio->ref)) {
> +		struct inode *inode = file_inode(iocb->ki_filp);
> +
> +		if (dio->wait_for_completion) {
> +			struct task_struct *waiter = dio->submit.waiter;
> +
> +			WRITE_ONCE(dio->submit.waiter, NULL);
> +			blk_wake_io_task(waiter);
> +		} else if (!inode->i_mapping->nrpages) {
> +			WRITE_ONCE(iocb->private, NULL);
> +
> +			/*
> +			 * We must never invalidate pages from this thread to
> +			 * avoid deadlocks with buffered I/O completions.
> +			 * Tough luck if you hit the tiny race with someone
> +			 * dirtying the range now.

What happens, exactly?  Does that mean that the dirty pagecache always
survives?

--D

> +			 */
> +			dio->flags |= IOMAP_DIO_NO_INVALIDATE;
> +			iomap_dio_complete_work(&dio->aio.work);
> +		} else {
> +			INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
> +			queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work);
> +		}
> +	}
> +
> +	if (should_dirty) {
> +		bio_check_pages_dirty(&ioend->io_bio);
> +	} else {
> +		bio_release_pages(&ioend->io_bio, false);
> +		bio_put(&ioend->io_bio);
> +	}
> +
> +	return vec_count;
> +}
> +
>  void iomap_dio_bio_end_io(struct bio *bio)
>  {
>  	struct iomap_dio *dio = bio->bi_private;
> diff --git a/fs/iomap/internal.h b/fs/iomap/internal.h
> new file mode 100644
> index 000000000000..20cccfc3bb13
> --- /dev/null
> +++ b/fs/iomap/internal.h
> @@ -0,0 +1,7 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _IOMAP_INTERNAL_H
> +#define _IOMAP_INTERNAL_H 1
> +
> +u32 iomap_finish_ioend_direct(struct iomap_ioend *ioend);
> +
> +#endif /* _IOMAP_INTERNAL_H */
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index eaa8cb9083eb..f6943c80e5fd 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -343,9 +343,11 @@ sector_t iomap_bmap(struct address_space *mapping, sector_t bno,
>  #define IOMAP_IOEND_UNWRITTEN		(1U << 1)
>  /* don't merge into previous ioend */
>  #define IOMAP_IOEND_BOUNDARY		(1U << 2)
> +/* is direct I/O */
> +#define IOMAP_IOEND_DIRECT		(1U << 3)
>  
>  #define IOMAP_IOEND_NOMERGE_FLAGS \
> -	(IOMAP_IOEND_SHARED | IOMAP_IOEND_UNWRITTEN)
> +	(IOMAP_IOEND_SHARED | IOMAP_IOEND_UNWRITTEN | IOMAP_IOEND_DIRECT)
>  
>  /*
>   * Structure for writeback I/O completions.
> -- 
> 2.45.2
> 
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 8/8] iomap: pass private data to iomap_truncate_page
  2024-12-11  8:53 ` [PATCH 8/8] iomap: pass private data to iomap_truncate_page Christoph Hellwig
@ 2024-12-12 19:56   ` Darrick J. Wong
  0 siblings, 0 replies; 30+ messages in thread
From: Darrick J. Wong @ 2024-12-12 19:56 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Christian Brauner, Carlos Maiolino, linux-xfs, linux-fsdevel

On Wed, Dec 11, 2024 at 09:53:48AM +0100, Christoph Hellwig wrote:
> Allow the file system to pass private data which can be used by the
> iomap_begin and iomap_end methods through the private pointer in the
> iomap_iter structure.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Patches 6-8 look obviously correct to me, so for those three:
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

--D

> ---
>  fs/iomap/buffered-io.c | 4 ++--
>  fs/xfs/xfs_iomap.c     | 2 +-
>  include/linux/iomap.h  | 2 +-
>  3 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 6bfee1c7aedb..ccb2c6cbb18e 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -1464,7 +1464,7 @@ EXPORT_SYMBOL_GPL(iomap_zero_range);
>  
>  int
>  iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
> -		const struct iomap_ops *ops)
> +		const struct iomap_ops *ops, void *private)
>  {
>  	unsigned int blocksize = i_blocksize(inode);
>  	unsigned int off = pos & (blocksize - 1);
> @@ -1473,7 +1473,7 @@ iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
>  	if (!off)
>  		return 0;
>  	return iomap_zero_range(inode, pos, blocksize - off, did_zero, ops,
> -			NULL);
> +			private);
>  }
>  EXPORT_SYMBOL_GPL(iomap_truncate_page);
>  
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 3410c55f544a..5dd0922fe2d1 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -1512,5 +1512,5 @@ xfs_truncate_page(
>  		return dax_truncate_page(inode, pos, did_zero,
>  					&xfs_dax_write_iomap_ops);
>  	return iomap_truncate_page(inode, pos, did_zero,
> -				   &xfs_buffered_write_iomap_ops);
> +				   &xfs_buffered_write_iomap_ops, NULL);
>  }
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 2a88dfa6ec55..19a2554622e6 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -315,7 +315,7 @@ int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
>  int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len,
>  		bool *did_zero, const struct iomap_ops *ops, void *private);
>  int iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
> -		const struct iomap_ops *ops);
> +		const struct iomap_ops *ops, void *private);
>  vm_fault_t iomap_page_mkwrite(struct vm_fault *vmf, const struct iomap_ops *ops,
>  		void *private);
>  typedef void (*iomap_punch_t)(struct inode *inode, loff_t offset, loff_t length,
> -- 
> 2.45.2
> 
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/8] iomap: split bios to zone append limits in the submission handlers
  2024-12-12 19:51   ` Darrick J. Wong
@ 2024-12-13  4:50     ` Christoph Hellwig
  0 siblings, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2024-12-13  4:50 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, Christian Brauner, Carlos Maiolino, linux-xfs,
	linux-fsdevel

On Thu, Dec 12, 2024 at 11:51:49AM -0800, Darrick J. Wong wrote:
> > +struct iomap_ioend *iomap_split_ioend(struct iomap_ioend *ioend, bool is_append,
> 
> Can you determine is_append from (ioend->io_flags & ZONE_APPEND)?

That would require us to add that flag first :)  As we don't really
need that as persistent per-iomap that it's probably not worth it.

> Also it's not clear to me what the initial and output state of
> *alloc_len is supposed to be?  I guess you set it to the number of bytes
> the @ioend covers?

It gets set to the number of blocks that the allocator could find,
and iomap_split_ioend decrements the amount of that it used for the
ioend returned, which is min(*alloc_len, max_zone_append_sectors) for
sequential zones, or *alloc_len for conventional zones.

> > +++ b/include/linux/iomap.h
> > @@ -354,6 +354,9 @@ struct iomap_ioend {
> >  	struct list_head	io_list;	/* next ioend in chain */
> >  	u16			io_flags;	/* IOMAP_IOEND_* */
> >  	struct inode		*io_inode;	/* file being written to */
> > +	atomic_t		io_remaining;	/* completetion defer count */
> > +	int			io_error;	/* stashed away status */
> > +	struct iomap_ioend	*io_parent;	/* parent for completions */
> 
> I guess this means ioends can chain together, sort of like how bios can
> when you split them?

Exactly.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/8] iomap: optionally use ioends for direct I/O
  2024-12-12 19:56   ` Darrick J. Wong
@ 2024-12-13  4:51     ` Christoph Hellwig
  0 siblings, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2024-12-13  4:51 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, Christian Brauner, Carlos Maiolino, linux-xfs,
	linux-fsdevel

On Thu, Dec 12, 2024 at 11:56:24AM -0800, Darrick J. Wong wrote:
> >  		return 0;
> > +	if (ioend->io_flags & IOMAP_IOEND_DIRECT)
> > +		return iomap_finish_ioend_direct(ioend);
> >  	return iomap_finish_ioend_buffered(ioend);
> >  }
> 
> I'm a little surprised that more of the iomap_ioend* functions didn't
> end up in ioend.c.

See the cover letter.  For development I wanted to avoid churn.  Once
we have general approval for the concept I'd like to move more code.

> > +			WRITE_ONCE(dio->submit.waiter, NULL);
> > +			blk_wake_io_task(waiter);
> > +		} else if (!inode->i_mapping->nrpages) {
> > +			WRITE_ONCE(iocb->private, NULL);
> > +
> > +			/*
> > +			 * We must never invalidate pages from this thread to
> > +			 * avoid deadlocks with buffered I/O completions.
> > +			 * Tough luck if you hit the tiny race with someone
> > +			 * dirtying the range now.
> 
> What happens, exactly?  Does that mean that the dirty pagecache always
> survives?

Yes.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/8] iomap: simplify io_flags and io_type in struct iomap_ioend
  2024-12-12 17:55   ` Darrick J. Wong
@ 2024-12-13  4:53     ` Christoph Hellwig
  0 siblings, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2024-12-13  4:53 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, Christian Brauner, Carlos Maiolino, linux-xfs,
	linux-fsdevel

On Thu, Dec 12, 2024 at 09:55:04AM -0800, Darrick J. Wong wrote:
> > +#define IOMAP_IOEND_NOMERGE_FLAGS \
> > +	(IOMAP_IOEND_SHARED | IOMAP_IOEND_UNWRITTEN)
> 
> Hmm.  At first I wondered "Why wouldn't BOUNDARY be in here too?  It
> also prevents merging of ioends."  Then I remembered that BOUNDARY is an
> explicit nomerge flag, whereas what NOMERGE_FLAGS provides is that we
> always split ioends whenever the ioend work changes.
> 
> How about a comment?
> 
> /* split ioends when the type of completion work changes */
> #define IOMAP_IOEND_NOMERGE_FLAGS \
> 	(IOMAP_IOEND_SHARED | IOMAP_IOEND_UNWRITTEN)
> 
> Otherwise this looks fine to me.

The interesting thing about BOUNDARY is not just that it's explicit, but
also that it's one-way.  We can merge a non-BOUNDARY flag into the end
of a BOUNDARY one, just not a BOUNDARY one into the end of a non-BOUNDARY
one.

> 
> --D
> 
> > +
> >  /*
> >   * Structure for writeback I/O completions.
> >   */
> >  struct iomap_ioend {
> >  	struct list_head	io_list;	/* next ioend in chain */
> > -	u16			io_type;
> > -	u16			io_flags;	/* IOMAP_F_* */
> > +	u16			io_flags;	/* IOMAP_IOEND_* */
> >  	struct inode		*io_inode;	/* file being written to */
> >  	size_t			io_size;	/* size of the extent */
> >  	loff_t			io_offset;	/* offset in the file */
> > -- 
> > 2.45.2
> > 
> > 
---end quoted text---

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/8] iomap: add a IOMAP_F_ZONE_APPEND flag
  2024-12-12 18:05   ` Darrick J. Wong
@ 2024-12-13  4:55     ` Christoph Hellwig
  2024-12-16  4:55     ` Christoph Hellwig
  1 sibling, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2024-12-13  4:55 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, Christian Brauner, Carlos Maiolino, linux-xfs,
	linux-fsdevel

On Thu, Dec 12, 2024 at 10:05:47AM -0800, Darrick J. Wong wrote:
> On Wed, Dec 11, 2024 at 09:53:43AM +0100, Christoph Hellwig wrote:
> > This doesn't much - just always returns the start block number for each
> 
> "This doesn't much" - I don't understand the sentence very well.  How
> about:
> 
> "Add a new IOMAP_F_ZONE_APPEND flag for the filesystem to indicate that
> the storage device must inform us where it wrote the data, so that the
> filesystem can update its own internal mapping metadata.  The filesystem
> should set the starting address of the zone in iomap::addr, and extract
> the LBA address from the bio during ioend processing.  iomap builds
> bios unconstrained by the hardware limits and will split them in the bio
> submission handler."

Sounds reasonable.

> The splitting happens whenever IOMAP_F_BOUNDARY gets set by
> ->map_blocks, right?

The splitting happens when:

 (a) we run out of enough space in one zone and have to switch to another
     one.  That then most likely picks an new zone, in which case
     IOMAP_F_BOUNDARY will be set
 (b) if the hardware I/O constraints require splitting the I/O (typically
     because it's too larger)


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/8] iomap: add a IOMAP_F_ZONE_APPEND flag
  2024-12-12 18:05   ` Darrick J. Wong
  2024-12-13  4:55     ` Christoph Hellwig
@ 2024-12-16  4:55     ` Christoph Hellwig
  2024-12-19 17:30       ` Darrick J. Wong
  1 sibling, 1 reply; 30+ messages in thread
From: Christoph Hellwig @ 2024-12-16  4:55 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, Christian Brauner, Carlos Maiolino, linux-xfs,
	linux-fsdevel

On Thu, Dec 12, 2024 at 10:05:47AM -0800, Darrick J. Wong wrote:
> >  #endif /* CONFIG_BUFFER_HEAD */
> >  #define IOMAP_F_XATTR		(1U << 5)
> >  #define IOMAP_F_BOUNDARY	(1U << 6)
> > +#define IOMAP_F_ZONE_APPEND	(1U << 7)
> 
> Needs a corresponding update in Documentation/iomap/ before we merge
> this series.

Btw, while doing these updates I start to really despise
Documentation/filesystems/iomap/.  It basically just duplicates what's
alredy in the code comments, just far away from the code to which it
belongs in that awkward RST format.  I'm not sure what it is supposed
to buy us?


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/8] iomap: add a IOMAP_F_ZONE_APPEND flag
  2024-12-16  4:55     ` Christoph Hellwig
@ 2024-12-19 17:30       ` Darrick J. Wong
  2024-12-19 17:35         ` Christoph Hellwig
  0 siblings, 1 reply; 30+ messages in thread
From: Darrick J. Wong @ 2024-12-19 17:30 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Christian Brauner, Carlos Maiolino, linux-xfs, linux-fsdevel

On Mon, Dec 16, 2024 at 05:55:54AM +0100, Christoph Hellwig wrote:
> On Thu, Dec 12, 2024 at 10:05:47AM -0800, Darrick J. Wong wrote:
> > >  #endif /* CONFIG_BUFFER_HEAD */
> > >  #define IOMAP_F_XATTR		(1U << 5)
> > >  #define IOMAP_F_BOUNDARY	(1U << 6)
> > > +#define IOMAP_F_ZONE_APPEND	(1U << 7)
> > 
> > Needs a corresponding update in Documentation/iomap/ before we merge
> > this series.
> 
> Btw, while doing these updates I start to really despise
> Documentation/filesystems/iomap/.  It basically just duplicates what's
> alredy in the code comments, just far away from the code to which it
> belongs in that awkward RST format.  I'm not sure what it is supposed
> to buy us?

The iomap documentation has been helpful for the people working on
porting existing filesystems to iomap, because it provides a story for
"if you want to implement file operation X, this is how you'd do that in
iomap".  Seeing as we're coming up on -rc4 now, do you want to just send
your iomap patches and I'll go deal with the docs?

That way we have two people who understand what the two new flags mean.
:)

--D

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/8] iomap: add a IOMAP_F_ZONE_APPEND flag
  2024-12-19 17:30       ` Darrick J. Wong
@ 2024-12-19 17:35         ` Christoph Hellwig
  2024-12-19 17:36           ` Darrick J. Wong
  0 siblings, 1 reply; 30+ messages in thread
From: Christoph Hellwig @ 2024-12-19 17:35 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, Christian Brauner, Carlos Maiolino, linux-xfs,
	linux-fsdevel

On Thu, Dec 19, 2024 at 09:30:30AM -0800, Darrick J. Wong wrote:
> The iomap documentation has been helpful for the people working on
> porting existing filesystems to iomap, because it provides a story for
> "if you want to implement file operation X, this is how you'd do that in
> iomap".  Seeing as we're coming up on -rc4 now, do you want to just send
> your iomap patches and I'll go deal with the docs?

I've actually writtem (or rahter copy and pasted) the documentation
already, that's what made me notice how duplicative it is.  But let
me shoot out what I currently have.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/8] iomap: add a IOMAP_F_ZONE_APPEND flag
  2024-12-19 17:35         ` Christoph Hellwig
@ 2024-12-19 17:36           ` Darrick J. Wong
  0 siblings, 0 replies; 30+ messages in thread
From: Darrick J. Wong @ 2024-12-19 17:36 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Christian Brauner, Carlos Maiolino, linux-xfs, linux-fsdevel

On Thu, Dec 19, 2024 at 06:35:12PM +0100, Christoph Hellwig wrote:
> On Thu, Dec 19, 2024 at 09:30:30AM -0800, Darrick J. Wong wrote:
> > The iomap documentation has been helpful for the people working on
> > porting existing filesystems to iomap, because it provides a story for
> > "if you want to implement file operation X, this is how you'd do that in
> > iomap".  Seeing as we're coming up on -rc4 now, do you want to just send
> > your iomap patches and I'll go deal with the docs?
> 
> I've actually writtem (or rahter copy and pasted) the documentation
> already, that's what made me notice how duplicative it is.  But let
> me shoot out what I currently have.

Oh, ok.  I was planning to repost the rtrmap/reflink series shortly FYI.

--D

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2024-12-19 17:36 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-11  8:53 RFC: iomap patches for zoned XFS Christoph Hellwig
2024-12-11  8:53 ` [PATCH 1/8] iomap: allow the file system to submit the writeback bios Christoph Hellwig
2024-12-12 17:48   ` Darrick J. Wong
2024-12-11  8:53 ` [PATCH 2/8] iomap: simplify io_flags and io_type in struct iomap_ioend Christoph Hellwig
2024-12-12 17:55   ` Darrick J. Wong
2024-12-13  4:53     ` Christoph Hellwig
2024-12-11  8:53 ` [PATCH 3/8] iomap: add a IOMAP_F_ZONE_APPEND flag Christoph Hellwig
2024-12-12 18:05   ` Darrick J. Wong
2024-12-13  4:55     ` Christoph Hellwig
2024-12-16  4:55     ` Christoph Hellwig
2024-12-19 17:30       ` Darrick J. Wong
2024-12-19 17:35         ` Christoph Hellwig
2024-12-19 17:36           ` Darrick J. Wong
2024-12-11  8:53 ` [PATCH 4/8] iomap: split bios to zone append limits in the submission handlers Christoph Hellwig
2024-12-12 13:28   ` Brian Foster
2024-12-12 15:05     ` Christoph Hellwig
2024-12-12 14:21   ` John Garry
2024-12-12 15:07     ` Christoph Hellwig
2024-12-12 19:51   ` Darrick J. Wong
2024-12-13  4:50     ` Christoph Hellwig
2024-12-11  8:53 ` [PATCH 5/8] iomap: optionally use ioends for direct I/O Christoph Hellwig
2024-12-12 13:29   ` Brian Foster
2024-12-12 15:12     ` Christoph Hellwig
2024-12-12 19:56   ` Darrick J. Wong
2024-12-13  4:51     ` Christoph Hellwig
2024-12-11  8:53 ` [PATCH 6/8] iomap: pass private data to iomap_page_mkwrite Christoph Hellwig
2024-12-11  8:53 ` [PATCH 7/8] iomap: pass private data to iomap_zero_range Christoph Hellwig
2024-12-11  8:53 ` [PATCH 8/8] iomap: pass private data to iomap_truncate_page Christoph Hellwig
2024-12-12 19:56   ` Darrick J. Wong
2024-12-12 13:29 ` RFC: iomap patches for zoned XFS Brian Foster

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox