[RFC 0/3] Add buffered write-through support to iomap & xfs

public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC 0/3] Add buffered write-through support to iomap & xfs
@ 2026-03-09 17:34 Ojaswin Mujoo
  2026-03-09 17:34 ` [RFC 1/3] iomap: Support buffered RWF_WRITETHROUGH via async dio backend Ojaswin Mujoo
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Ojaswin Mujoo @ 2026-03-09 17:34 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: djwong, john.g.garry, willy, hch, ritesh.list, jack,
	Luis Chamberlain, dgc, tytso, p.raghav, andres, linux-kernel

Hi all,

This patchset implements an early design prototype of buffered I/O
write-through semantics in linux.
 
This idea mainly picked up traction to enable RWF_ATOMIC buffered IO [1],
however write-through path can have many use cases beyond atomic writes, 
- such as enabling truly async AIO buffered I/O when issued with O_DSYNC   
- better scalability for buffered I/O

The implementation of write-through combines the buffered IO frontend
with async dio backend, which leads to some interesting interactions.
I've added most of the design notes in respective patches. Please note
that this is an initial RFC to iron out any early design issues. This is
largely based on suggestions from Dave an Jan in [1] so thanks for the
pointers!


* Testing Notes *

- I've added support for RWF_WRITETHROUGH to fsx and fsstress in
  xfstests and these patches survive fsx with integrity verification as
  well as fsstress parallel stressing.
- -g quick with blocks size == page size and blocksize < pagesize shows
  no new regressions.


* Design TODOs *

- Evaluate if we need to tag page cache dirty bit in xarray, since 
  PG_Writeback is already set on the folio.
- As mentioned in patch 2, we call ->iomap_begin() twice which is not
  ideal but is kept this way to avoid churn and keep the PoC minimal.
  Look into a better way to refactor this.
- Fix support with filesystem freezing.


* Future work (once design is finalized) *

- Add aio O_DSYNC buffered write-through support
- Add RWF_ATOMIC support for buffered IO via write-through path
- Add support of other RWF_ flags for write-through buffered I/O path including
- Benchmarking numbers and more thorough testing needed.

As usual, thoughts and suggestions are welcome.

[1] https://lore.kernel.org/all/d0c4d95b-8064-4a7e-996d-7ad40eb4976b@linux.dev/

Regards,
ojaswin

Ojaswin Mujoo (3):
  iomap: Support buffered RWF_WRITETHROUGH via async dio backend
  iomap: Enable stable writes for RWF_WRITETHROUGH inodes
  xfs: Add RWF_WRITETHROUGH support to xfs

 fs/inode.c              |   1 +
 fs/iomap/buffered-io.c  | 414 ++++++++++++++++++++++++++++++++++++++++
 fs/iomap/direct-io.c    |  64 ++++---
 fs/xfs/xfs_file.c       |  68 ++++++-
 include/linux/fs.h      |   9 +
 include/linux/iomap.h   |  34 ++++
 include/uapi/linux/fs.h |   5 +-
 7 files changed, 568 insertions(+), 27 deletions(-)

-- 
2.52.0


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC 1/3] iomap: Support buffered RWF_WRITETHROUGH via async dio backend
  2026-03-09 17:34 [RFC 0/3] Add buffered write-through support to iomap & xfs Ojaswin Mujoo
@ 2026-03-09 17:34 ` Ojaswin Mujoo
  2026-03-10  6:48   ` Dave Chinner
  2026-03-09 17:34 ` [RFC 2/3] iomap: Enable stable writes for RWF_WRITETHROUGH inodes Ojaswin Mujoo
  2026-03-09 17:34 ` [RFC 3/3] xfs: Add RWF_WRITETHROUGH support to xfs Ojaswin Mujoo
  2 siblings, 1 reply; 11+ messages in thread
From: Ojaswin Mujoo @ 2026-03-09 17:34 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: djwong, john.g.garry, willy, hch, ritesh.list, jack,
	Luis Chamberlain, dgc, tytso, p.raghav, andres, linux-kernel

This adds initial support for performing buffered RWF_WRITETHROUGH write.
The rough flow for a writethrough write is as follows:

1. Acquire inode lock and call iomap begin to get an allocated mapping.
2. Acquire folio lock.
3. Perform a memcpy from user buffer to the folio and mark it dirty
4. Wait for any current writeback to complete and then call folio_mkclean()
   to prevent mmap writes from changing it.
5. Start writeback on the folio
6. Use dio codepath to send an asynchronous dio. We use the
   inode_dio_begin/end() logic for writethrough as well to serialize
   against paths like truncate.
7. Once the IO is queued, the write syscall is free to unlock folio and
   return.
8. In the endio path, cleanup resources, record any errors and clear
   writeback on folio.

Few things to note in the design:

1. Folio handling note: We might be writing through a partial folio so
we need to be careful to not clear the folio dirty bit unless there are
no dirty blocks in the folio after the writethrough.

2. we call iomap_begin() twice, one at the start and the other is within
the iomap_dio_rw(). Functionally this should be okay since the second
call should just return whatever the first call allocated. This
redundancy is kept as a tradeoff to avoid churning too much code for
the initial PoC.

3. Along with the writeback bit, we also use inode_dio_begin() to
synchronize against paths like truncate. This might be too restrictive
but we can look into this in next revisions.

4. Freezing support is a WIP. Check the comment on top of
iomap_writethrough_iter()

Suggested-by: Jan Kara <jack@suse.cz>
Suggested-by: Dave Chinner <dgc@kernel.org>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
---
 fs/iomap/buffered-io.c  | 383 ++++++++++++++++++++++++++++++++++++++++
 fs/iomap/direct-io.c    |  62 ++++---
 include/linux/fs.h      |   7 +
 include/linux/iomap.h   |  32 ++++
 include/uapi/linux/fs.h |   5 +-
 5 files changed, 466 insertions(+), 23 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 3cf93ab2e38a..ab169daa1126 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -9,6 +9,7 @@
 #include <linux/swap.h>
 #include <linux/migrate.h>
 #include <linux/fserror.h>
+#include <linux/rmap.h>
 #include "internal.h"
 #include "trace.h"
 
@@ -1091,6 +1092,351 @@ static bool iomap_write_end(struct iomap_iter *iter, size_t len, size_t copied,
 	return __iomap_write_end(iter->inode, pos, len, copied, folio);
 }
 
+static void iomap_writethrough_endio(struct kiocb *iocb, long ret)
+{
+	struct iomap_writethrough_ctx *wt_ctx =
+		container_of(iocb, struct iomap_writethrough_ctx, iocb);
+	struct inode *inode = wt_ctx->inode;
+
+	/*
+	 * NOTE: Is ret always < 0 for short writes? ioend_writeback_end_io
+	 * seems to suggest so.
+	 */
+	if (ret < 0) {
+		mapping_set_error(inode->i_mapping, ret);
+		pr_err_ratelimited(
+			"%s: writeback error on inode %lu, offset %lld",
+			inode->i_sb->s_id, inode->i_ino, iocb->ki_pos);
+	}
+
+	fput(iocb->ki_filp);
+	folio_end_writeback(wt_ctx->folio);
+	kfree(wt_ctx->bvec);
+	kfree(wt_ctx);
+}
+
+/*
+ * Check the pos and length of writethrough satisfy the constraints.
+ * Returns false if checks fail, else true.
+ */
+static bool iomap_writethrough_checks(struct kiocb *iocb, size_t off, loff_t len,
+				      struct folio *folio)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	int bs = i_blocksize(inode);
+	loff_t start = iocb->ki_pos;
+	loff_t end = start + len;
+	loff_t folio_end = folio_pos(folio) + folio_size(folio);
+
+	/*
+	 * start and length should be algned to block size.
+	 */
+	if (WARN_ON((start | len) & (bs - 1)))
+		return false;
+
+	/*
+	 * We modified start as well as offset in folio, so make sure they are
+	 * still in sync
+	 */
+	if (WARN_ON(off != offset_in_folio(folio, start)))
+		return false;
+
+	/*
+	 * Range should be contained in folio
+	 */
+	if (WARN_ON(start < folio_pos(folio) || end > folio_end))
+		return false;
+
+	return true;
+}
+
+/*
+ * With writethrough, we might potentially be writing through a partial
+ * folio hence we don't clear the dirty bit (yet)
+ */
+static void folio_prepare_writethrough(struct folio *folio)
+{
+	if (folio_test_writeback(folio))
+		folio_wait_writeback(folio);
+
+	/*
+	 * TODO: We are trying to avoid folio_mkclean() usages but we need it
+	 * here to serialize against mmap writes. Is there a better way?
+	 */
+	if (folio_mkclean(folio))
+		/* Refer folio_clear_dirty_for_io() for why this is needed */
+		folio_mark_dirty(folio);
+
+}
+
+/**
+ * iomap_writethrough_begin - prepare the various structures for writethrough
+ * @iocb:
+ * @folio:
+ * @iter:
+ * @wt_ctx: holds context needed during IO and endio
+ * @iov_wt: (output) will hold the iov_iter that can be passed to dio
+ * @wt_off: (input/output) holds the offset of write. Upon return, will hold the
+ *          aligned offset
+ * @wt_len: (input/output) holds the len of write. Upon return, will hold the
+ *          aligned len
+ *
+ * This function does the major preparation work needed before starting the
+ * writethrough. The main task is to prepare folio for writeththrough (by
+ * setting writeback on it) and to ensure the offset and len are block aligned
+ * so that dio doesn't complain.
+ *
+ * In case an error is encountered, the folio writeback wont be started and the
+ * range under write through would still be dirty.
+ */
+static int iomap_writethrough_begin(struct kiocb *iocb, struct folio *folio,
+				    struct iomap_iter *iter,
+				    struct iomap_writethrough_ctx *wt_ctx,
+				    struct iov_iter *iov_wt, size_t offset,
+				    u64 len)
+{
+	int bs = i_blocksize(iter->inode);
+
+	size_t off_aligned = round_down(offset, bs);
+	u64 len_aligned = round_up(len, bs);
+	u64 pos_aligned = round_down(iter->pos, bs);
+	bool fully_written;
+	u64 zero = 0;
+
+	folio_prepare_writethrough(folio);
+
+	wt_ctx->bvec = kmalloc(sizeof(struct bio_vec), GFP_KERNEL | GFP_NOFS);
+	if (!wt_ctx->bvec)
+		return -ENOMEM;
+
+	bvec_set_folio(wt_ctx->bvec, folio, len_aligned, off_aligned);
+	iov_iter_bvec(iov_wt, ITER_SOURCE, wt_ctx->bvec, 1, len_aligned);
+
+	kiocb_clone(&wt_ctx->iocb, iocb, iocb->ki_filp);
+	wt_ctx->iocb.ki_pos = pos_aligned;
+	wt_ctx->iocb.ki_complete = iomap_writethrough_endio;
+	wt_ctx->folio = folio;
+	wt_ctx->inode = iter->inode;
+	wt_ctx->orig_pos = iter->pos;
+	wt_ctx->orig_len = len;
+
+	if (!iomap_writethrough_checks(
+			&wt_ctx->iocb, off_aligned,
+			iov_iter_count(iov_wt), folio)) {
+		/* This should never happen */
+		WARN_ON_ONCE(true);
+
+		kfree(wt_ctx->bvec);
+		return -EINVAL;
+	}
+
+	get_file(wt_ctx->iocb.ki_filp);
+
+	/*
+	 * We might either write through the complete folio or a partial folio
+	 * writethrough might result in all blocks becoming non-dirty, so we need to
+	 * check and mark the folio clean if that is the case.
+	 */
+	fully_written = (off_aligned == 0 && len_aligned == folio_size(folio));
+
+	iomap_clear_range_dirty(folio, off_aligned, len_aligned);
+	if (fully_written ||
+	    !iomap_find_dirty_range(folio, &zero, folio_size(folio)))
+		folio_clear_dirty(folio);
+
+	folio_start_writeback(folio);
+
+	return 0;
+}
+
+/**
+ * iomap_writethrough_iter - perform RWF_WRITETHROUGH buffered write
+ * @iocb: kernel iocb struct
+ * @iter: iomap iter holding mapping information
+ * @i: iov_iter for write
+ * @wt_ops: the fs callbacks needed for writethrough
+ *
+ * This function copies the user buffer to folio similar to usual buffered
+ * IO path, with the difference that we immediately issue the IO. For this we
+ * utilize the async dio machinery. While issuing the async IO, we need to be
+ * careful to clone the iocb so that it doesnt get destroyed underneath us
+ * incase the syscall exits before endio() is triggered.
+ *
+ * Folio handling note: We might be writing through a partial folio so we need
+ * to be careful to not clear the folio dirty bit unless there are no dirty blocks
+ * in the folio after the writethrough.
+ *
+ * TODO: Filesystem freezing during ongoing writethrough writes is currently
+ * buggy. We call file_start_write() once before taking any lock but we can't
+ * just simply call the corresponding file_end_write() in endio because single
+ * RWF_WRITETHROUGH might be split into many IOs leading to multiple endio()
+ * calls. Currently we are looking into the right way to synchronise with
+ * freeze_super().
+ */
+static int iomap_writethrough_iter(struct kiocb *iocb, struct iomap_iter *iter,
+				   struct iov_iter *i,
+				   const struct iomap_writethrough_ops *wt_ops)
+{
+	ssize_t total_written = 0;
+	int status = 0;
+	struct address_space *mapping = iter->inode->i_mapping;
+	size_t chunk = mapping_max_folio_size(mapping);
+	unsigned int bdp_flags = (iter->flags & IOMAP_NOWAIT) ? BDP_ASYNC : 0;
+
+	if (!(iter->flags & IOMAP_WRITETHROUGH))
+		return -EINVAL;
+
+	do {
+		struct folio *folio;
+		loff_t old_size;
+		size_t offset;		/* Offset into folio */
+		u64 bytes;		/* Bytes to write to folio */
+		size_t copied;		/* Bytes copied from user */
+		u64 written;		/* Bytes have been written */
+		loff_t pos;
+		bool noretry = false;
+
+		bytes = iov_iter_count(i);
+retry:
+		offset = iter->pos & (chunk - 1);
+		bytes = min(chunk - offset, bytes);
+		status = balance_dirty_pages_ratelimited_flags(mapping,
+							       bdp_flags);
+		if (unlikely(status))
+			break;
+
+		if (bytes > iomap_length(iter))
+			bytes = iomap_length(iter);
+
+		/*
+		 * Bring in the user page that we'll copy from _first_.
+		 * Otherwise there's a nasty deadlock on copying from the
+		 * same page as we're writing to, without it being marked
+		 * up-to-date.
+		 *
+		 * For async buffered writes the assumption is that the user
+		 * page has already been faulted in. This can be optimized by
+		 * faulting the user page.
+		 */
+		if (unlikely(fault_in_iov_iter_readable(i, bytes) == bytes)) {
+			status = -EFAULT;
+			break;
+		}
+
+		status = iomap_write_begin(iter, wt_ops->write_ops, &folio,
+					   &offset, &bytes);
+		if (unlikely(status)) {
+			iomap_write_failed(iter->inode, iter->pos, bytes);
+			break;
+		}
+		if (iter->iomap.flags & IOMAP_F_STALE)
+			break;
+
+		pos = iter->pos;
+
+		if (mapping_writably_mapped(mapping))
+			flush_dcache_folio(folio);
+
+		copied = copy_folio_from_iter_atomic(folio, offset, bytes, i);
+		written = iomap_write_end(iter, bytes, copied, folio) ?
+			  copied : 0;
+
+		/*
+		 * Update the in-memory inode size after copying the data into
+		 * the page cache.  It's up to the file system to write the
+		 * updated size to disk, preferably after I/O completion so that
+		 * no stale data is exposed.  Only once that's done can we
+		 * unlock and release the folio.
+		 */
+		old_size = iter->inode->i_size;
+		if (pos + written > old_size) {
+			i_size_write(iter->inode, pos + written);
+			iter->iomap.flags |= IOMAP_F_SIZE_CHANGED;
+		}
+
+		if (!written)
+			goto put_folio;
+
+		/*
+		 * The copy-to-folio operation succeeded. Lets use the dio
+		 * machinery to send the writethrough IO.
+		 */
+		if (written) {
+			struct iomap_writethrough_ctx *wt_ctx;
+			int dio_flags = IOMAP_DIO_BUF_WRITETHROUGH;
+			struct iov_iter iov_wt;
+
+			wt_ctx = kzalloc(sizeof(struct iomap_writethrough_ctx),
+					GFP_KERNEL | GFP_NOFS);
+			if (!wt_ctx) {
+				status = -ENOMEM;
+				written = 0;
+				goto put_folio;
+			}
+
+			status = iomap_writethrough_begin(iocb, folio, iter,
+							  wt_ctx, &iov_wt,
+							  offset, written);
+			if (status < 0) {
+				if (status != -ENOMEM)
+					noretry = true;
+				written = 0;
+				kfree(wt_ctx);
+				goto put_folio;
+			}
+
+			/* Dont retry for any failures in writethrough */
+			noretry = true;
+
+			status = iomap_dio_rw(&wt_ctx->iocb, &iov_wt,
+					      wt_ops->ops, wt_ops->dio_ops,
+					      dio_flags, NULL, 0);
+
+			/*
+			 * If IO is queued, then we will do all the cleanup
+			 * during ioend so just unlock the folio.
+			 */
+			if (status == -EIOCBQUEUED)
+				goto put_folio;
+
+			/*
+			 * We either encountered an error or IO completed. In
+			 * either case, it is now safe to free up resources and
+			 * end writeback.
+			 */
+			if (status < 0)
+				written = 0;
+
+			iomap_writethrough_endio(&wt_ctx->iocb, status);
+		}
+put_folio:
+		__iomap_put_folio(iter, wt_ops->write_ops, written, folio);
+
+		if (old_size < pos)
+			pagecache_isize_extended(iter->inode, old_size, pos);
+
+		cond_resched();
+		if (unlikely(written == 0)) {
+			iomap_write_failed(iter->inode, pos, bytes);
+			iov_iter_revert(i, copied);
+
+			if (noretry)
+				break;
+			if (chunk > PAGE_SIZE)
+				chunk /= 2;
+			if (copied) {
+				bytes = copied;
+				goto retry;
+			}
+		} else {
+			total_written += written;
+			iomap_iter_advance(iter, written);
+		}
+	} while (iov_iter_count(i) && iomap_length(iter));
+
+	return total_written ? 0 : status;
+}
+
 static int iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i,
 		const struct iomap_write_ops *write_ops)
 {
@@ -1227,6 +1573,43 @@ iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *i,
 }
 EXPORT_SYMBOL_GPL(iomap_file_buffered_write);
 
+ssize_t iomap_file_writethrough_write(struct kiocb *iocb, struct iov_iter *i,
+				      const struct iomap_writethrough_ops *wt_ops,
+				      void *private)
+{
+	struct iomap_iter iter = {
+		.inode		= iocb->ki_filp->f_mapping->host,
+		.pos		= iocb->ki_pos,
+		.len		= iov_iter_count(i),
+		.flags		= IOMAP_WRITE,
+		.private	= private,
+	};
+	ssize_t ret;
+
+	/*
+	 * For now we don't support any other flag with WRITETHROUGH
+	 */
+	if (!(iocb->ki_flags & IOCB_WRITETHROUGH))
+		return -EINVAL;
+	if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_DONTCACHE))
+		return -EINVAL;
+
+	iter.flags |= IOMAP_WRITETHROUGH;
+
+	while ((ret = iomap_iter(&iter, wt_ops->ops)) > 0) {
+		WARN_ON(iter.iomap.type != IOMAP_UNWRITTEN &&
+			iter.iomap.type != IOMAP_MAPPED);
+		iter.status = iomap_writethrough_iter(iocb, &iter, i, wt_ops);
+	}
+
+	if (unlikely(iter.pos == iocb->ki_pos))
+		return ret;
+	ret = iter.pos - iocb->ki_pos;
+	iocb->ki_pos = iter.pos;
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iomap_file_writethrough_write);
+
 static void iomap_write_delalloc_ifs_punch(struct inode *inode,
 		struct folio *folio, loff_t start_byte, loff_t end_byte,
 		struct iomap *iomap, iomap_punch_t punch)
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index c24d94349ca5..f4d8ff08a83a 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -713,7 +713,8 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	dio->i_size = i_size_read(inode);
 	dio->dops = dops;
 	dio->error = 0;
-	dio->flags = dio_flags & (IOMAP_DIO_FSBLOCK_ALIGNED | IOMAP_DIO_BOUNCE);
+	dio->flags = dio_flags & (IOMAP_DIO_FSBLOCK_ALIGNED | IOMAP_DIO_BOUNCE |
+				  IOMAP_DIO_BUF_WRITETHROUGH);
 	dio->done_before = done_before;
 
 	dio->submit.iter = iter;
@@ -747,8 +748,13 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		if (iocb->ki_flags & IOCB_ATOMIC)
 			iomi.flags |= IOMAP_ATOMIC;
 
-		/* for data sync or sync, we need sync completion processing */
-		if (iocb_is_dsync(iocb)) {
+		/*
+		 * for data sync or sync, we need sync completion processing.
+		 * for buffered writethrough, sync is handled in buffered IO
+		 * path so not needed here
+		 */
+		if (iocb_is_dsync(iocb) &&
+		    !(dio->flags & IOMAP_DIO_BUF_WRITETHROUGH)) {
 			dio->flags |= IOMAP_DIO_NEED_SYNC;
 
 		       /*
@@ -765,35 +771,47 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		}
 
 		/*
-		 * i_size updates must to happen from process context.
+		 * i_size updates must to happen from process context. For
+		 * buffered writetthrough, caller might have already changed the
+		 * i_size but still needs endio i_size handling. We can't detect
+		 * this here so just use process context unconditionally.
 		 */
-		if (iomi.pos + iomi.len > dio->i_size)
+		if ((iomi.pos + iomi.len > dio->i_size) ||
+		    dio_flags & IOMAP_DIO_BUF_WRITETHROUGH)
 			dio->flags |= IOMAP_DIO_COMP_WORK;
 
 		/*
 		 * Try to invalidate cache pages for the range we are writing.
 		 * If this invalidation fails, let the caller fall back to
 		 * buffered I/O.
+		 *
+		 * The execption is if we are using dio path for buffered
+		 * RWF_WRITETHROUGH in which case we cannot inavlidate the pages
+		 * as we are writing them through and already hold their
+		 * folio_lock. For the same reason, disable end of write invalidation
 		 */
-		ret = kiocb_invalidate_pages(iocb, iomi.len);
-		if (ret) {
-			if (ret != -EAGAIN) {
-				trace_iomap_dio_invalidate_fail(inode, iomi.pos,
-								iomi.len);
-				if (iocb->ki_flags & IOCB_ATOMIC) {
-					/*
-					 * folio invalidation failed, maybe
-					 * this is transient, unlock and see if
-					 * the caller tries again.
-					 */
-					ret = -EAGAIN;
-				} else {
-					/* fall back to buffered write */
-					ret = -ENOTBLK;
+		if (!(dio_flags & IOMAP_DIO_BUF_WRITETHROUGH)) {
+			ret = kiocb_invalidate_pages(iocb, iomi.len);
+			if (ret) {
+				if (ret != -EAGAIN) {
+					trace_iomap_dio_invalidate_fail(inode, iomi.pos,
+									iomi.len);
+					if (iocb->ki_flags & IOCB_ATOMIC) {
+						/*
+						* folio invalidation failed, maybe
+						* this is transient, unlock and see if
+						* the caller tries again.
+						*/
+						ret = -EAGAIN;
+					} else {
+						/* fall back to buffered write */
+						ret = -ENOTBLK;
+					}
 				}
+				goto out_free_dio;
 			}
-			goto out_free_dio;
-		}
+		} else
+			dio->flags |= IOMAP_DIO_NO_INVALIDATE;
 	}
 
 	if (!wait_for_completion && !inode->i_sb->s_dio_done_wq) {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 8b3dd145b25e..ca291957140e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -346,6 +346,7 @@ struct readahead_control;
 #define IOCB_ATOMIC		(__force int) RWF_ATOMIC
 #define IOCB_DONTCACHE		(__force int) RWF_DONTCACHE
 #define IOCB_NOSIGNAL		(__force int) RWF_NOSIGNAL
+#define IOCB_WRITETHROUGH	(__force int) RWF_WRITETHROUGH
 
 /* non-RWF related bits - start at 16 */
 #define IOCB_EVENTFD		(1 << 16)
@@ -1985,6 +1986,8 @@ struct file_operations {
 #define FOP_ASYNC_LOCK		((__force fop_flags_t)(1 << 6))
 /* File system supports uncached read/write buffered IO */
 #define FOP_DONTCACHE		((__force fop_flags_t)(1 << 7))
+/* File system supports write through buffered IO */
+#define FOP_WRITETHROUGH	((__force fop_flags_t)(1 << 8))
 
 /* Wrap a directory iterator that needs exclusive inode access */
 int wrap_directory_iterator(struct file *, struct dir_context *,
@@ -3436,6 +3439,10 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags,
 		if (IS_DAX(ki->ki_filp->f_mapping->host))
 			return -EOPNOTSUPP;
 	}
+	if (flags & RWF_WRITETHROUGH)
+		/* file system must support it */
+		if (!(ki->ki_filp->f_op->fop_flags & FOP_WRITETHROUGH))
+			return -EOPNOTSUPP;
 	kiocb_flags |= (__force int) (flags & RWF_SUPPORTED);
 	if (flags & RWF_SYNC)
 		kiocb_flags |= IOCB_DSYNC;
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 531f9ebdeeae..b96574bb2918 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -209,6 +209,7 @@ struct iomap_write_ops {
 #endif /* CONFIG_FS_DAX */
 #define IOMAP_ATOMIC		(1 << 9) /* torn-write protection */
 #define IOMAP_DONTCACHE		(1 << 10)
+#define IOMAP_WRITETHROUGH	(1 << 11)
 
 struct iomap_ops {
 	/*
@@ -475,6 +476,15 @@ struct iomap_writepage_ctx {
 	void			*wb_ctx;	/* pending writeback context */
 };
 
+struct iomap_writethrough_ctx {
+	struct kiocb iocb;
+	struct folio *folio;
+	struct inode *inode;
+	struct bio_vec *bvec;
+	loff_t orig_pos;
+	loff_t orig_len;
+};
+
 struct iomap_ioend *iomap_init_ioend(struct inode *inode, struct bio *bio,
 		loff_t file_offset, u16 ioend_flags);
 struct iomap_ioend *iomap_split_ioend(struct iomap_ioend *ioend,
@@ -590,6 +600,14 @@ struct iomap_dio_ops {
  */
 #define IOMAP_DIO_BOUNCE		(1 << 4)
 
+/*
+ * Set when we are using the dio path to perform writethrough for
+ * RWF_WRITETHROUGH buffered write. The ->endio handler must check this
+ * to perform any writethrough related cleanup like ending writeback on
+ * a folio.
+ */
+#define IOMAP_DIO_BUF_WRITETHROUGH	(1 << 5)
+
 ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
 		unsigned int dio_flags, void *private, size_t done_before);
@@ -599,6 +617,20 @@ struct iomap_dio *__iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 ssize_t iomap_dio_complete(struct iomap_dio *dio);
 void iomap_dio_bio_end_io(struct bio *bio);
 
+/*
+ * In writethrough, we copy user data to folio first and then send the folio
+ * to writeback via dio path. To achieve this, we need callbacks from iomap_ops,
+ * iomap_write_ops and iomap_dio_ops. This struct packs them together.
+ */
+struct iomap_writethrough_ops {
+	const struct iomap_ops *ops;
+	const struct iomap_write_ops *write_ops;
+	const struct iomap_dio_ops *dio_ops;
+};
+ssize_t iomap_file_writethrough_write(struct kiocb *iocb, struct iov_iter *i,
+				      const struct iomap_writethrough_ops *wt_ops,
+				      void *private);
+
 #ifdef CONFIG_SWAP
 struct file;
 struct swap_info_struct;
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 70b2b661f42c..dec78041b0cf 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -435,10 +435,13 @@ typedef int __bitwise __kernel_rwf_t;
 /* prevent pipe and socket writes from raising SIGPIPE */
 #define RWF_NOSIGNAL	((__force __kernel_rwf_t)0x00000100)
 
+/* buffered IO that is asynchronously written through to disk after write */
+#define RWF_WRITETHROUGH	((__force __kernel_rwf_t)0x00000200)
+
 /* mask of flags supported by the kernel */
 #define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
 			 RWF_APPEND | RWF_NOAPPEND | RWF_ATOMIC |\
-			 RWF_DONTCACHE | RWF_NOSIGNAL)
+			 RWF_DONTCACHE | RWF_NOSIGNAL | RWF_WRITETHROUGH)
 
 #define PROCFS_IOCTL_MAGIC 'f'
 
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC 2/3] iomap: Enable stable writes for RWF_WRITETHROUGH inodes
  2026-03-09 17:34 [RFC 0/3] Add buffered write-through support to iomap & xfs Ojaswin Mujoo
  2026-03-09 17:34 ` [RFC 1/3] iomap: Support buffered RWF_WRITETHROUGH via async dio backend Ojaswin Mujoo
@ 2026-03-09 17:34 ` Ojaswin Mujoo
  2026-03-10  3:57   ` Darrick J. Wong
  2026-03-09 17:34 ` [RFC 3/3] xfs: Add RWF_WRITETHROUGH support to xfs Ojaswin Mujoo
  2 siblings, 1 reply; 11+ messages in thread
From: Ojaswin Mujoo @ 2026-03-09 17:34 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: djwong, john.g.garry, willy, hch, ritesh.list, jack,
	Luis Chamberlain, dgc, tytso, p.raghav, andres, linux-kernel

Currently, RWF_WRITETHROUGH writes wait for writeback to complete
on a folio before performing the writethrough. This serializes
writethrough with each other and the writeback path. However, it is also
desirable have similar guarantees between RWF_WRITETHROUGH and non
writethrough writes.

Hence, ensure stable writes are enabled on an inode's mapping as
long as a writethrough write is ongoing. This way, all paths will
wait for RWF_WRITETHROUGH to complete on a folio before proceeding.

To track inflight writethrough writes, we use an atomic counter in the
inode->i_mapping. This struct was chosen because (i) writethrough is an
operation on the folio and (ii) we don't want to add bloat to struct
inode.

Suggested-by: Dave Chinner <dgc@kernel.org>
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
---
 fs/inode.c             |  1 +
 fs/iomap/buffered-io.c | 35 +++++++++++++++++++++++++++++++++--
 fs/iomap/direct-io.c   |  2 ++
 include/linux/fs.h     |  2 ++
 include/linux/iomap.h  |  2 ++
 5 files changed, 40 insertions(+), 2 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index cc12b68e021b..5b779c112ff8 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -280,6 +280,7 @@ int inode_init_always_gfp(struct super_block *sb, struct inode *inode, gfp_t gfp
 	mapping->flags = 0;
 	mapping->wb_err = 0;
 	atomic_set(&mapping->i_mmap_writable, 0);
+	atomic_set(&mapping->i_wt_count, 0);
 #ifdef CONFIG_READ_ONLY_THP_FOR_FS
 	atomic_set(&mapping->nr_thps, 0);
 #endif
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index ab169daa1126..9d4d459af1a0 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1150,11 +1150,41 @@ static bool iomap_writethrough_checks(struct kiocb *iocb, size_t off, loff_t len
 	return true;
 }
 
+/**
+ * inode_writethrough_begin - signal start of a RWF_WRITETHROUGH request
+ * @inode: inode the writethrough happens on
+ *
+ * This is called when we are about to start a writethrough on an inode.
+ * If it is the first writethrough, set the mapping as stable to ensure
+ * other folio operations wait for writeback to finish.
+ *
+ * To avoid a race, just set the mapping stable first and then increment
+ * writethrough count, so that the stable writes are enforced as soon as
+ * writethrough count becomes non zero.
+ */
+inline void inode_writethrough_begin(struct inode *inode)
+{
+	mapping_set_stable_writes(inode->i_mapping);
+	atomic_inc(&inode->i_mapping->i_wt_count);
+}
+
+/**
+ * inode_writethrough_end - signal finish of a RWF_WRITETHROUGH request
+ * @inode: inode the writethrough I/O happened on
+ *
+ * This is called once we've finished processing a writethrough request
+ */
+inline void inode_writethrough_end(struct inode *inode)
+{
+	if (atomic_dec_and_test(&inode->i_mapping->i_wt_count))
+		mapping_clear_stable_writes(inode->i_mapping);
+}
+
 /*
  * With writethrough, we might potentially be writing through a partial
  * folio hence we don't clear the dirty bit (yet)
  */
-static void folio_prepare_writethrough(struct folio *folio)
+static void folio_prepare_writethrough(struct inode *inode, struct folio *folio)
 {
 	if (folio_test_writeback(folio))
 		folio_wait_writeback(folio);
@@ -1167,6 +1197,7 @@ static void folio_prepare_writethrough(struct folio *folio)
 		/* Refer folio_clear_dirty_for_io() for why this is needed */
 		folio_mark_dirty(folio);
 
+	inode_writethrough_begin(inode);
 }
 
 /**
@@ -1203,7 +1234,7 @@ static int iomap_writethrough_begin(struct kiocb *iocb, struct folio *folio,
 	bool fully_written;
 	u64 zero = 0;
 
-	folio_prepare_writethrough(folio);
+	folio_prepare_writethrough(iter->inode, folio);
 
 	wt_ctx->bvec = kmalloc(sizeof(struct bio_vec), GFP_KERNEL | GFP_NOFS);
 	if (!wt_ctx->bvec)
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index f4d8ff08a83a..12680d97d765 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -140,6 +140,8 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
 		kiocb_invalidate_post_direct_write(iocb, dio->size);
 
 	inode_dio_end(file_inode(iocb->ki_filp));
+	if (dio->flags & IOMAP_DIO_BUF_WRITETHROUGH)
+		inode_writethrough_end(file_inode(iocb->ki_filp));
 
 	if (ret > 0) {
 		iocb->ki_pos += ret;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ca291957140e..6b7491fdd51a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -456,6 +456,7 @@ extern const struct address_space_operations empty_aops;
  *   memory mappings.
  * @gfp_mask: Memory allocation flags to use for allocating pages.
  * @i_mmap_writable: Number of VM_SHARED, VM_MAYWRITE mappings.
+ * @i_wt_count: Number of RWF_WRITETHROUGH writes ongoing in mapping.
  * @nr_thps: Number of THPs in the pagecache (non-shmem only).
  * @i_mmap: Tree of private and shared mappings.
  * @i_mmap_rwsem: Protects @i_mmap and @i_mmap_writable.
@@ -474,6 +475,7 @@ struct address_space {
 	struct rw_semaphore	invalidate_lock;
 	gfp_t			gfp_mask;
 	atomic_t		i_mmap_writable;
+	atomic_t		i_wt_count;
 #ifdef CONFIG_READ_ONLY_THP_FOR_FS
 	/* number of thp, only for non-shmem files */
 	atomic_t		nr_thps;
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index b96574bb2918..6d08b966ceaf 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -630,6 +630,8 @@ struct iomap_writethrough_ops {
 ssize_t iomap_file_writethrough_write(struct kiocb *iocb, struct iov_iter *i,
 				      const struct iomap_writethrough_ops *wt_ops,
 				      void *private);
+inline void inode_writethrough_begin(struct inode *inode);
+inline void inode_writethrough_end(struct inode *inode);
 
 #ifdef CONFIG_SWAP
 struct file;
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC 3/3] xfs: Add RWF_WRITETHROUGH support to xfs
  2026-03-09 17:34 [RFC 0/3] Add buffered write-through support to iomap & xfs Ojaswin Mujoo
  2026-03-09 17:34 ` [RFC 1/3] iomap: Support buffered RWF_WRITETHROUGH via async dio backend Ojaswin Mujoo
  2026-03-09 17:34 ` [RFC 2/3] iomap: Enable stable writes for RWF_WRITETHROUGH inodes Ojaswin Mujoo
@ 2026-03-09 17:34 ` Ojaswin Mujoo
  2 siblings, 0 replies; 11+ messages in thread
From: Ojaswin Mujoo @ 2026-03-09 17:34 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: djwong, john.g.garry, willy, hch, ritesh.list, jack,
	Luis Chamberlain, dgc, tytso, p.raghav, andres, linux-kernel

Add the boilerplate needed to start supporting RWF_WRITETHROUGH in XFS.
We use the direct wirte ->iomap_begin() functions to ensure the range
under write back always has a real non-delalloc extent. We reuse the
xfs dio's end IO function to perform extent conversion and i_size handling
for us.

*Note on EOF edge case*

Buffered writethrough IO uses dio path but allows non block aligned
writes. The IO we submit is later rounded to block size boundary.
However, for end io processing, we must pass the original range to
xfs_dio_write_end_io(). This is important for non block-aligned EOF
writes because otherwise XFS might update the i_size to more than what
the user originally wrote, exposing stale data.

Hence, add a wrapper over xfs_dio_write_end_io() to modify iocb->ki_pos
and the size of IO to correspond to the original range, so that our
extent conversion and i_size updates are correct.

Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
---
 fs/xfs/xfs_file.c | 68 ++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 64 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 6246f34df9fd..3eb868a2ba63 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -657,6 +657,55 @@ static const struct iomap_dio_ops xfs_dio_write_ops = {
 	.end_io		= xfs_dio_write_end_io,
 };
 
+/*
+ * *Note on EOF edge case*
+ *
+ * Buffered writethrough IO uses dio path but allows non block aligned
+ * writes. The IO we submit is later rounded to block size boundary.
+ * However, for end io processing, we must pass the original range to
+ * xfs_dio_write_end_io(). This is important for non block-aligned EOF
+ * writes because otherwise XFS might update the i_size to more than what
+ * the user originally wrote, exposing stale data.
+ *
+ * Hence, modify iocb->ki_pos and the size of IO to correspond to the original
+ * range, so that our extent conversion and i_size updates are correct.
+ */
+static int
+xfs_writethrough_end_io(
+	struct kiocb		*iocb,
+	ssize_t			size,
+	int			error,
+	unsigned		flags)
+{
+	struct iomap_writethrough_ctx *wt_ctx =
+		container_of(iocb, struct iomap_writethrough_ctx, iocb);
+	loff_t len = wt_ctx->orig_len;
+	loff_t end  = iocb->ki_pos + size;
+	loff_t orig_end  = wt_ctx->orig_pos + wt_ctx->orig_len;
+
+	/*
+	 * We have a short write that didn't even cover the original range.
+	 * Nothing to do
+	 */
+	if (end <= wt_ctx->orig_pos)
+		return 0;
+
+	/*
+	 * Short write partially covers original range. Trim the range to short
+	 * write's end.
+	 */
+	if (end < orig_end)
+		len = end - wt_ctx->orig_pos;
+
+	iocb->ki_pos = wt_ctx->orig_pos;
+
+	return xfs_dio_write_end_io(iocb, len, error, flags);
+}
+
+static const struct iomap_dio_ops xfs_dio_writethrough_ops = {
+	.end_io		= xfs_writethrough_end_io,
+};
+
 static void
 xfs_dio_zoned_submit_io(
 	const struct iomap_iter	*iter,
@@ -988,6 +1037,13 @@ xfs_file_dax_write(
 	return ret;
 }
 
+const struct iomap_writethrough_ops xfs_writethrough_ops = {
+	.ops			= &xfs_direct_write_iomap_ops,
+	.write_ops		= &xfs_iomap_write_ops,
+	.dio_ops		= &xfs_dio_writethrough_ops,
+};
+
+
 STATIC ssize_t
 xfs_file_buffered_write(
 	struct kiocb		*iocb,
@@ -1010,9 +1066,13 @@ xfs_file_buffered_write(
 		goto out;
 
 	trace_xfs_file_buffered_write(iocb, from);
-	ret = iomap_file_buffered_write(iocb, from,
-			&xfs_buffered_write_iomap_ops, &xfs_iomap_write_ops,
-			NULL);
+	if (iocb->ki_flags & IOCB_WRITETHROUGH) {
+		ret = iomap_file_writethrough_write(iocb, from,
+						    &xfs_writethrough_ops, NULL);
+	} else
+		ret = iomap_file_buffered_write(iocb, from,
+						&xfs_buffered_write_iomap_ops,
+						&xfs_iomap_write_ops, NULL);
 
 	/*
 	 * If we hit a space limit, try to free up some lingering preallocated
@@ -2042,7 +2102,7 @@ const struct file_operations xfs_file_operations = {
 	.remap_file_range = xfs_file_remap_range,
 	.fop_flags	= FOP_MMAP_SYNC | FOP_BUFFER_RASYNC |
 			  FOP_BUFFER_WASYNC | FOP_DIO_PARALLEL_WRITE |
-			  FOP_DONTCACHE,
+			  FOP_DONTCACHE | FOP_WRITETHROUGH,
 	.setlease	= generic_setlease,
 };
 
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC 2/3] iomap: Enable stable writes for RWF_WRITETHROUGH inodes
  2026-03-09 17:34 ` [RFC 2/3] iomap: Enable stable writes for RWF_WRITETHROUGH inodes Ojaswin Mujoo
@ 2026-03-10  3:57   ` Darrick J. Wong
  2026-03-10  5:25     ` Ritesh Harjani
  0 siblings, 1 reply; 11+ messages in thread
From: Darrick J. Wong @ 2026-03-10  3:57 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: linux-xfs, linux-fsdevel, john.g.garry, willy, hch, ritesh.list,
	jack, Luis Chamberlain, dgc, tytso, p.raghav, andres,
	linux-kernel

On Mon, Mar 09, 2026 at 11:04:32PM +0530, Ojaswin Mujoo wrote:
> Currently, RWF_WRITETHROUGH writes wait for writeback to complete
> on a folio before performing the writethrough. This serializes
> writethrough with each other and the writeback path. However, it is also
> desirable have similar guarantees between RWF_WRITETHROUGH and non
> writethrough writes.
> 
> Hence, ensure stable writes are enabled on an inode's mapping as
> long as a writethrough write is ongoing. This way, all paths will
> wait for RWF_WRITETHROUGH to complete on a folio before proceeding.
> 
> To track inflight writethrough writes, we use an atomic counter in the
> inode->i_mapping. This struct was chosen because (i) writethrough is an
> operation on the folio and (ii) we don't want to add bloat to struct
> inode.

What if we just set it whenever someone successfully initiates a
RWF_WRITETHROUGH write?  Then we wouldn't need all this atomic counter
machinery.

Also: What if some filesystem (not xfs, obviously) finds a need to
change the stablepages bit while there might be writethrough writes in
progress?  It's a little awkward to have a flag /and/ a counter; why not
change mapping_{set,clear}_stable_pages to inc and dec the counter and
base the test off that?

--D

> Suggested-by: Dave Chinner <dgc@kernel.org>
> Suggested-by: Jan Kara <jack@suse.cz>
> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> ---
>  fs/inode.c             |  1 +
>  fs/iomap/buffered-io.c | 35 +++++++++++++++++++++++++++++++++--
>  fs/iomap/direct-io.c   |  2 ++
>  include/linux/fs.h     |  2 ++
>  include/linux/iomap.h  |  2 ++
>  5 files changed, 40 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index cc12b68e021b..5b779c112ff8 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -280,6 +280,7 @@ int inode_init_always_gfp(struct super_block *sb, struct inode *inode, gfp_t gfp
>  	mapping->flags = 0;
>  	mapping->wb_err = 0;
>  	atomic_set(&mapping->i_mmap_writable, 0);
> +	atomic_set(&mapping->i_wt_count, 0);
>  #ifdef CONFIG_READ_ONLY_THP_FOR_FS
>  	atomic_set(&mapping->nr_thps, 0);
>  #endif
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index ab169daa1126..9d4d459af1a0 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -1150,11 +1150,41 @@ static bool iomap_writethrough_checks(struct kiocb *iocb, size_t off, loff_t len
>  	return true;
>  }
>  
> +/**
> + * inode_writethrough_begin - signal start of a RWF_WRITETHROUGH request
> + * @inode: inode the writethrough happens on
> + *
> + * This is called when we are about to start a writethrough on an inode.
> + * If it is the first writethrough, set the mapping as stable to ensure
> + * other folio operations wait for writeback to finish.
> + *
> + * To avoid a race, just set the mapping stable first and then increment
> + * writethrough count, so that the stable writes are enforced as soon as
> + * writethrough count becomes non zero.
> + */
> +inline void inode_writethrough_begin(struct inode *inode)
> +{
> +	mapping_set_stable_writes(inode->i_mapping);
> +	atomic_inc(&inode->i_mapping->i_wt_count);
> +}
> +
> +/**
> + * inode_writethrough_end - signal finish of a RWF_WRITETHROUGH request
> + * @inode: inode the writethrough I/O happened on
> + *
> + * This is called once we've finished processing a writethrough request
> + */
> +inline void inode_writethrough_end(struct inode *inode)
> +{
> +	if (atomic_dec_and_test(&inode->i_mapping->i_wt_count))
> +		mapping_clear_stable_writes(inode->i_mapping);
> +}
> +
>  /*
>   * With writethrough, we might potentially be writing through a partial
>   * folio hence we don't clear the dirty bit (yet)
>   */
> -static void folio_prepare_writethrough(struct folio *folio)
> +static void folio_prepare_writethrough(struct inode *inode, struct folio *folio)
>  {
>  	if (folio_test_writeback(folio))
>  		folio_wait_writeback(folio);
> @@ -1167,6 +1197,7 @@ static void folio_prepare_writethrough(struct folio *folio)
>  		/* Refer folio_clear_dirty_for_io() for why this is needed */
>  		folio_mark_dirty(folio);
>  
> +	inode_writethrough_begin(inode);
>  }
>  
>  /**
> @@ -1203,7 +1234,7 @@ static int iomap_writethrough_begin(struct kiocb *iocb, struct folio *folio,
>  	bool fully_written;
>  	u64 zero = 0;
>  
> -	folio_prepare_writethrough(folio);
> +	folio_prepare_writethrough(iter->inode, folio);
>  
>  	wt_ctx->bvec = kmalloc(sizeof(struct bio_vec), GFP_KERNEL | GFP_NOFS);
>  	if (!wt_ctx->bvec)
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index f4d8ff08a83a..12680d97d765 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -140,6 +140,8 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
>  		kiocb_invalidate_post_direct_write(iocb, dio->size);
>  
>  	inode_dio_end(file_inode(iocb->ki_filp));
> +	if (dio->flags & IOMAP_DIO_BUF_WRITETHROUGH)
> +		inode_writethrough_end(file_inode(iocb->ki_filp));
>  
>  	if (ret > 0) {
>  		iocb->ki_pos += ret;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index ca291957140e..6b7491fdd51a 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -456,6 +456,7 @@ extern const struct address_space_operations empty_aops;
>   *   memory mappings.
>   * @gfp_mask: Memory allocation flags to use for allocating pages.
>   * @i_mmap_writable: Number of VM_SHARED, VM_MAYWRITE mappings.
> + * @i_wt_count: Number of RWF_WRITETHROUGH writes ongoing in mapping.
>   * @nr_thps: Number of THPs in the pagecache (non-shmem only).
>   * @i_mmap: Tree of private and shared mappings.
>   * @i_mmap_rwsem: Protects @i_mmap and @i_mmap_writable.
> @@ -474,6 +475,7 @@ struct address_space {
>  	struct rw_semaphore	invalidate_lock;
>  	gfp_t			gfp_mask;
>  	atomic_t		i_mmap_writable;
> +	atomic_t		i_wt_count;
>  #ifdef CONFIG_READ_ONLY_THP_FOR_FS
>  	/* number of thp, only for non-shmem files */
>  	atomic_t		nr_thps;
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index b96574bb2918..6d08b966ceaf 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -630,6 +630,8 @@ struct iomap_writethrough_ops {
>  ssize_t iomap_file_writethrough_write(struct kiocb *iocb, struct iov_iter *i,
>  				      const struct iomap_writethrough_ops *wt_ops,
>  				      void *private);
> +inline void inode_writethrough_begin(struct inode *inode);
> +inline void inode_writethrough_end(struct inode *inode);
>  
>  #ifdef CONFIG_SWAP
>  struct file;
> -- 
> 2.52.0
> 
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 2/3] iomap: Enable stable writes for RWF_WRITETHROUGH inodes
  2026-03-10  3:57   ` Darrick J. Wong
@ 2026-03-10  5:25     ` Ritesh Harjani
  2026-03-11  6:27       ` Ojaswin Mujoo
  0 siblings, 1 reply; 11+ messages in thread
From: Ritesh Harjani @ 2026-03-10  5:25 UTC (permalink / raw)
  To: Darrick J. Wong, Ojaswin Mujoo
  Cc: linux-xfs, linux-fsdevel, john.g.garry, willy, hch, jack,
	Luis Chamberlain, dgc, tytso, p.raghav, andres, linux-kernel

"Darrick J. Wong" <djwong@kernel.org> writes:

> On Mon, Mar 09, 2026 at 11:04:32PM +0530, Ojaswin Mujoo wrote:
>> Currently, RWF_WRITETHROUGH writes wait for writeback to complete
>> on a folio before performing the writethrough. This serializes
>> writethrough with each other and the writeback path. However, it is also
>> desirable have similar guarantees between RWF_WRITETHROUGH and non
>> writethrough writes.
>> 
>> Hence, ensure stable writes are enabled on an inode's mapping as
>> long as a writethrough write is ongoing. This way, all paths will
>> wait for RWF_WRITETHROUGH to complete on a folio before proceeding.
>> 
>> To track inflight writethrough writes, we use an atomic counter in the
>> inode->i_mapping. This struct was chosen because (i) writethrough is an
>> operation on the folio and (ii) we don't want to add bloat to struct
>> inode.

Now I am also questioning the need of this counter.
If mapping has AS_STABLE_WRITES bit set, then that means the
inode->mapping is going through stable writes until that bit is
cleared. And since in future we are going to add support of async
buffered write-through, so the stable writes bit should get cleared in
the completion path (like how it is done now.) 

>
> What if we just set it whenever someone successfully initiates a
> RWF_WRITETHROUGH write?  Then we wouldn't need all this atomic counter
> machinery.
>

I agree. If we set the mapping as stable before initiating
iomap_write_begin() itself, then we don't need this atomic counter.

Maybe, we can set it in iomap_file_writethrough_write() itself
(we have mapping available from iocb).


> Also: What if some filesystem (not xfs, obviously) finds a need to
> change the stablepages bit while there might be writethrough writes in
> progress?

Is there a usecase where this can happen (just curious)?

> It's a little awkward to have a flag /and/ a counter; why not
> change mapping_{set,clear}_stable_pages to inc and dec the counter and
> base the test off that?
>

Yes, either ways, I agree that I don't see the need of an extra counter here.

-ritesh

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 1/3] iomap: Support buffered RWF_WRITETHROUGH via async dio backend
  2026-03-09 17:34 ` [RFC 1/3] iomap: Support buffered RWF_WRITETHROUGH via async dio backend Ojaswin Mujoo
@ 2026-03-10  6:48   ` Dave Chinner
  2026-03-11 10:35     ` Ojaswin Mujoo
  0 siblings, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2026-03-10  6:48 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: linux-xfs, linux-fsdevel, djwong, john.g.garry, willy, hch,
	ritesh.list, jack, Luis Chamberlain, tytso, p.raghav, andres,
	linux-kernel

On Mon, Mar 09, 2026 at 11:04:31PM +0530, Ojaswin Mujoo wrote:
> +/**
> + * iomap_writethrough_iter - perform RWF_WRITETHROUGH buffered write
> + * @iocb: kernel iocb struct
> + * @iter: iomap iter holding mapping information
> + * @i: iov_iter for write
> + * @wt_ops: the fs callbacks needed for writethrough
> + *
> + * This function copies the user buffer to folio similar to usual buffered
> + * IO path, with the difference that we immediately issue the IO. For this we
> + * utilize the async dio machinery. While issuing the async IO, we need to be
> + * careful to clone the iocb so that it doesnt get destroyed underneath us
> + * incase the syscall exits before endio() is triggered.
> + *
> + * Folio handling note: We might be writing through a partial folio so we need
> + * to be careful to not clear the folio dirty bit unless there are no dirty blocks
> + * in the folio after the writethrough.
> + *
> + * TODO: Filesystem freezing during ongoing writethrough writes is currently
> + * buggy. We call file_start_write() once before taking any lock but we can't
> + * just simply call the corresponding file_end_write() in endio because single
> + * RWF_WRITETHROUGH might be split into many IOs leading to multiple endio()
> + * calls. Currently we are looking into the right way to synchronise with
> + * freeze_super().
> + */
> +static int iomap_writethrough_iter(struct kiocb *iocb, struct iomap_iter *iter,
> +				   struct iov_iter *i,
> +				   const struct iomap_writethrough_ops *wt_ops)
> +{
> +	ssize_t total_written = 0;
> +	int status = 0;
> +	struct address_space *mapping = iter->inode->i_mapping;
> +	size_t chunk = mapping_max_folio_size(mapping);
> +	unsigned int bdp_flags = (iter->flags & IOMAP_NOWAIT) ? BDP_ASYNC : 0;
> +
> +	if (!(iter->flags & IOMAP_WRITETHROUGH))
> +		return -EINVAL;
> +
> +	do {
......
> +		status = iomap_write_begin(iter, wt_ops->write_ops, &folio,
> +					   &offset, &bytes);
> +		if (unlikely(status)) {
> +			iomap_write_failed(iter->inode, iter->pos, bytes);
> +			break;
> +		}
> +		if (iter->iomap.flags & IOMAP_F_STALE)
> +			break;
> +
> +		pos = iter->pos;
> +
> +		if (mapping_writably_mapped(mapping))
> +			flush_dcache_folio(folio);
> +
> +		copied = copy_folio_from_iter_atomic(folio, offset, bytes, i);
> +		written = iomap_write_end(iter, bytes, copied, folio) ?
> +			  copied : 0;
......
> +		/*
> +		 * The copy-to-folio operation succeeded. Lets use the dio
> +		 * machinery to send the writethrough IO.
> +		 */
> +		if (written) {
> +			struct iomap_writethrough_ctx *wt_ctx;
> +			int dio_flags = IOMAP_DIO_BUF_WRITETHROUGH;
> +			struct iov_iter iov_wt;
> +
> +			wt_ctx = kzalloc(sizeof(struct iomap_writethrough_ctx),
> +					GFP_KERNEL | GFP_NOFS);
> +			if (!wt_ctx) {
> +				status = -ENOMEM;
> +				written = 0;
> +				goto put_folio;
> +			}
> +
> +			status = iomap_writethrough_begin(iocb, folio, iter,
> +							  wt_ctx, &iov_wt,
> +							  offset, written);
> +			if (status < 0) {
> +				if (status != -ENOMEM)
> +					noretry = true;
> +				written = 0;
> +				kfree(wt_ctx);
> +				goto put_folio;
> +			}
> +
> +			/* Dont retry for any failures in writethrough */
> +			noretry = true;
> +
> +			status = iomap_dio_rw(&wt_ctx->iocb, &iov_wt,
> +					      wt_ops->ops, wt_ops->dio_ops,
> +					      dio_flags, NULL, 0);
.....

This is not what I envisiaged write-through using DIO to look like.
This is a DIO per folio, rather than a DIO per write() syscall. We
want the latter to be the common case, not the former, especially
when it comes to RWF_ATOMIC support.

i.e. I was expecting something more like having a wt context
allocated up front with an appropriately sized bvec appended to it
(i.e. single allocation for the common case). Then in
iomap_write_end(), we'd mark the folio as under writeback and add it
to the bvec. Then we iterate through the IO range adding folio after
folio to the bvec.

When the bvec is full or we reach the end of the IO, we then push
that bvec down to the DIO code. Ideally we'd also push the iomap we
already hold down as well, so that the DIO code does not need to
look it up again (unless the mapping is stale). The DIO completion
callback then runs a completion callback that iterates the folios
attached ot the bvec and runs buffered writeback compeltion on them.
It can then decrements the wt-ctx IO-in-flight counter.

If there is more user data to submit, we keep going around (with a
new bvec if we need it) adding folios and submitting them to the dio
code until there is no more data to copy in and submit.

The writethrough context then drops it's own "in-flight" reference
and waits for the in-flight counter to go to zero.


> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index c24d94349ca5..f4d8ff08a83a 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -713,7 +713,8 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  	dio->i_size = i_size_read(inode);
>  	dio->dops = dops;
>  	dio->error = 0;
> -	dio->flags = dio_flags & (IOMAP_DIO_FSBLOCK_ALIGNED | IOMAP_DIO_BOUNCE);
> +	dio->flags = dio_flags & (IOMAP_DIO_FSBLOCK_ALIGNED | IOMAP_DIO_BOUNCE |
> +				  IOMAP_DIO_BUF_WRITETHROUGH);
>  	dio->done_before = done_before;
>  
>  	dio->submit.iter = iter;
> @@ -747,8 +748,13 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  		if (iocb->ki_flags & IOCB_ATOMIC)
>  			iomi.flags |= IOMAP_ATOMIC;
>  
> -		/* for data sync or sync, we need sync completion processing */
> -		if (iocb_is_dsync(iocb)) {
> +		/*
> +		 * for data sync or sync, we need sync completion processing.
> +		 * for buffered writethrough, sync is handled in buffered IO
> +		 * path so not needed here
> +		 */
> +		if (iocb_is_dsync(iocb) &&
> +		    !(dio->flags & IOMAP_DIO_BUF_WRITETHROUGH)) {
>  			dio->flags |= IOMAP_DIO_NEED_SYNC;

Ah, that looks wrong. We want writethrough to be able to use FUA
optimisations for RWF_DSYNC. This prevents the DIO write for wt from
setting IOMAP_DIO_WRITE_THROUGH which is needed to trigger FUA
writes for RWF_DSYNC ops.

i.e. we need DIO to handle the write completions directly to allow
conditional calling of generic_write_sync() based on whether FUA
writes were used or not.

> @@ -765,35 +771,47 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  		}
>  
>  		/*
> -		 * i_size updates must to happen from process context.
> +		 * i_size updates must to happen from process context. For
> +		 * buffered writetthrough, caller might have already changed the
> +		 * i_size but still needs endio i_size handling. We can't detect
> +		 * this here so just use process context unconditionally.
>  		 */
> -		if (iomi.pos + iomi.len > dio->i_size)
> +		if ((iomi.pos + iomi.len > dio->i_size) ||
> +		    dio_flags & IOMAP_DIO_BUF_WRITETHROUGH)
>  			dio->flags |= IOMAP_DIO_COMP_WORK;

This is only true because you called i_size_write() in
iomap_writethrough_iter() before calling down into the DIO code.
We should only update i_size on completion of the write-through IO,
not before we've submitted the IO.

i.e. i_size_write() should only be called by the IO completion that
drops the wt-ctx in-flight counter to zero. i.e. i_size should not
change until the entire IO is complete, it should not be updated
after each folio has the data copied into it.

>  		/*
>  		 * Try to invalidate cache pages for the range we are writing.
>  		 * If this invalidation fails, let the caller fall back to
>  		 * buffered I/O.
> +		 *
> +		 * The execption is if we are using dio path for buffered
> +		 * RWF_WRITETHROUGH in which case we cannot inavlidate the pages
> +		 * as we are writing them through and already hold their
> +		 * folio_lock. For the same reason, disable end of write invalidation
>  		 */
> -		ret = kiocb_invalidate_pages(iocb, iomi.len);
> -		if (ret) {
> -			if (ret != -EAGAIN) {
> -				trace_iomap_dio_invalidate_fail(inode, iomi.pos,
> -								iomi.len);
> -				if (iocb->ki_flags & IOCB_ATOMIC) {
> -					/*
> -					 * folio invalidation failed, maybe
> -					 * this is transient, unlock and see if
> -					 * the caller tries again.
> -					 */
> -					ret = -EAGAIN;
> -				} else {
> -					/* fall back to buffered write */
> -					ret = -ENOTBLK;
> +		if (!(dio_flags & IOMAP_DIO_BUF_WRITETHROUGH)) {
> +			ret = kiocb_invalidate_pages(iocb, iomi.len);
> +			if (ret) {
> +				if (ret != -EAGAIN) {
> +					trace_iomap_dio_invalidate_fail(inode, iomi.pos,
> +									iomi.len);
> +					if (iocb->ki_flags & IOCB_ATOMIC) {
> +						/*
> +						* folio invalidation failed, maybe
> +						* this is transient, unlock and see if
> +						* the caller tries again.
> +						*/
> +						ret = -EAGAIN;
> +					} else {
> +						/* fall back to buffered write */
> +						ret = -ENOTBLK;
> +					}
>  				}
> +				goto out_free_dio;
>  			}
> -			goto out_free_dio;
> -		}
> +		} else
> +			dio->flags |= IOMAP_DIO_NO_INVALIDATE;
>  	}

Waaaay too much indent. It is time to start factoring
__iomap_dio_rw() - it is turning into spaghetti with all these
conditional behaviours.

>  
>  	if (!wait_for_completion && !inode->i_sb->s_dio_done_wq) {
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 8b3dd145b25e..ca291957140e 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -346,6 +346,7 @@ struct readahead_control;
>  #define IOCB_ATOMIC		(__force int) RWF_ATOMIC
>  #define IOCB_DONTCACHE		(__force int) RWF_DONTCACHE
>  #define IOCB_NOSIGNAL		(__force int) RWF_NOSIGNAL
> +#define IOCB_WRITETHROUGH	(__force int) RWF_WRITETHROUGH
>  
>  /* non-RWF related bits - start at 16 */
>  #define IOCB_EVENTFD		(1 << 16)
> @@ -1985,6 +1986,8 @@ struct file_operations {
>  #define FOP_ASYNC_LOCK		((__force fop_flags_t)(1 << 6))
>  /* File system supports uncached read/write buffered IO */
>  #define FOP_DONTCACHE		((__force fop_flags_t)(1 << 7))
> +/* File system supports write through buffered IO */
> +#define FOP_WRITETHROUGH	((__force fop_flags_t)(1 << 8))
>  
>  /* Wrap a directory iterator that needs exclusive inode access */
>  int wrap_directory_iterator(struct file *, struct dir_context *,
> @@ -3436,6 +3439,10 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags,
>  		if (IS_DAX(ki->ki_filp->f_mapping->host))
>  			return -EOPNOTSUPP;
>  	}
> +	if (flags & RWF_WRITETHROUGH)
> +		/* file system must support it */
> +		if (!(ki->ki_filp->f_op->fop_flags & FOP_WRITETHROUGH))
> +			return -EOPNOTSUPP;

Needs {}

>  	kiocb_flags |= (__force int) (flags & RWF_SUPPORTED);
>  	if (flags & RWF_SYNC)
>  		kiocb_flags |= IOCB_DSYNC;
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 531f9ebdeeae..b96574bb2918 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -209,6 +209,7 @@ struct iomap_write_ops {
>  #endif /* CONFIG_FS_DAX */
>  #define IOMAP_ATOMIC		(1 << 9) /* torn-write protection */
>  #define IOMAP_DONTCACHE		(1 << 10)
> +#define IOMAP_WRITETHROUGH	(1 << 11)
>  
>  struct iomap_ops {
>  	/*
> @@ -475,6 +476,15 @@ struct iomap_writepage_ctx {
>  	void			*wb_ctx;	/* pending writeback context */
>  };
>  
> +struct iomap_writethrough_ctx {
> +	struct kiocb iocb;
> +	struct folio *folio;
> +	struct inode *inode;
> +	struct bio_vec *bvec;
> +	loff_t orig_pos;
> +	loff_t orig_len;
> +};
> +
>  struct iomap_ioend *iomap_init_ioend(struct inode *inode, struct bio *bio,
>  		loff_t file_offset, u16 ioend_flags);
>  struct iomap_ioend *iomap_split_ioend(struct iomap_ioend *ioend,
> @@ -590,6 +600,14 @@ struct iomap_dio_ops {
>   */
>  #define IOMAP_DIO_BOUNCE		(1 << 4)
>  
> +/*
> + * Set when we are using the dio path to perform writethrough for
> + * RWF_WRITETHROUGH buffered write. The ->endio handler must check this
> + * to perform any writethrough related cleanup like ending writeback on
> + * a folio.
> + */
> +#define IOMAP_DIO_BUF_WRITETHROUGH	(1 << 5)

I suspect that iomap should provide the dio ->endio handler itself
to run the higher level buffered IO completion handling. i.e. we
have callbacks for custom endio handling, I'm not sure that we need
logic flags for that...

-Dave.
-- 
Dave Chinner
dgc@kernel.org

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 2/3] iomap: Enable stable writes for RWF_WRITETHROUGH inodes
  2026-03-10  5:25     ` Ritesh Harjani
@ 2026-03-11  6:27       ` Ojaswin Mujoo
  0 siblings, 0 replies; 11+ messages in thread
From: Ojaswin Mujoo @ 2026-03-11  6:27 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: Darrick J. Wong, linux-xfs, linux-fsdevel, john.g.garry, willy,
	hch, jack, Luis Chamberlain, dgc, tytso, p.raghav, andres,
	linux-kernel

On Tue, Mar 10, 2026 at 10:55:04AM +0530, Ritesh Harjani wrote:
> "Darrick J. Wong" <djwong@kernel.org> writes:
> 
> > On Mon, Mar 09, 2026 at 11:04:32PM +0530, Ojaswin Mujoo wrote:
> >> Currently, RWF_WRITETHROUGH writes wait for writeback to complete
> >> on a folio before performing the writethrough. This serializes
> >> writethrough with each other and the writeback path. However, it is also
> >> desirable have similar guarantees between RWF_WRITETHROUGH and non
> >> writethrough writes.
> >> 
> >> Hence, ensure stable writes are enabled on an inode's mapping as
> >> long as a writethrough write is ongoing. This way, all paths will
> >> wait for RWF_WRITETHROUGH to complete on a folio before proceeding.
> >> 
> >> To track inflight writethrough writes, we use an atomic counter in the
> >> inode->i_mapping. This struct was chosen because (i) writethrough is an
> >> operation on the folio and (ii) we don't want to add bloat to struct
> >> inode.
> 
> Now I am also questioning the need of this counter.
> If mapping has AS_STABLE_WRITES bit set, then that means the
> inode->mapping is going through stable writes until that bit is
> cleared. And since in future we are going to add support of async
> buffered write-through, so the stable writes bit should get cleared in
> the completion path (like how it is done now.) 
> 
> >
> > What if we just set it whenever someone successfully initiates a
> > RWF_WRITETHROUGH write?  Then we wouldn't need all this atomic counter
> > machinery.
> >

> 
> I agree. If we set the mapping as stable before initiating
> iomap_write_begin() itself, then we don't need this atomic counter.
> 
> Maybe, we can set it in iomap_file_writethrough_write() itself
> (we have mapping available from iocb).

Hi Darrick, Ritesh,

Yes, I think we don't need the counter to know when to switch stable
writes on and off. Now that I'm thinking about it, maybe a mapping level
stable write is too restrictive? I understand that for certain hardware we
need it at mapping level but for cases like writethrough, all we need is
that particular folio to complete writeback. Why should we serialize
it with other non overlapping writes.

Maybe implementing a folio level stable writes or sprinkling around
folio_wait_writeback() makes more sense?

Also since we are on this topic, another thing that I should change is
where we call folio_mkclean(). Right now we call folio_mkclean() after
copying user data to pagecache, which means theres a window where mmap
write might change the data. I think we should proactively call it
before the memcpy?

> 
> 
> > Also: What if some filesystem (not xfs, obviously) finds a need to
> > change the stablepages bit while there might be writethrough writes in
> > progress?
> 
> Is there a usecase where this can happen (just curious)?
> 
> > It's a little awkward to have a flag /and/ a counter; why not
> > change mapping_{set,clear}_stable_pages to inc and dec the counter and
> > base the test off that?
> >
> 
> Yes, either ways, I agree that I don't see the need of an extra counter here.
> 
> -ritesh

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 1/3] iomap: Support buffered RWF_WRITETHROUGH via async dio backend
  2026-03-10  6:48   ` Dave Chinner
@ 2026-03-11 10:35     ` Ojaswin Mujoo
  2026-03-11 12:05       ` Dave Chinner
  0 siblings, 1 reply; 11+ messages in thread
From: Ojaswin Mujoo @ 2026-03-11 10:35 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-xfs, linux-fsdevel, djwong, john.g.garry, willy, hch,
	ritesh.list, jack, Luis Chamberlain, tytso, p.raghav, andres,
	linux-kernel

On Tue, Mar 10, 2026 at 05:48:12PM +1100, Dave Chinner wrote:
> On Mon, Mar 09, 2026 at 11:04:31PM +0530, Ojaswin Mujoo wrote:
> > +/**
> > + * iomap_writethrough_iter - perform RWF_WRITETHROUGH buffered write
> > + * @iocb: kernel iocb struct
> > + * @iter: iomap iter holding mapping information
> > + * @i: iov_iter for write
> > + * @wt_ops: the fs callbacks needed for writethrough
> > + *
> > + * This function copies the user buffer to folio similar to usual buffered
> > + * IO path, with the difference that we immediately issue the IO. For this we
> > + * utilize the async dio machinery. While issuing the async IO, we need to be
> > + * careful to clone the iocb so that it doesnt get destroyed underneath us
> > + * incase the syscall exits before endio() is triggered.
> > + *
> > + * Folio handling note: We might be writing through a partial folio so we need
> > + * to be careful to not clear the folio dirty bit unless there are no dirty blocks
> > + * in the folio after the writethrough.
> > + *
> > + * TODO: Filesystem freezing during ongoing writethrough writes is currently
> > + * buggy. We call file_start_write() once before taking any lock but we can't
> > + * just simply call the corresponding file_end_write() in endio because single
> > + * RWF_WRITETHROUGH might be split into many IOs leading to multiple endio()
> > + * calls. Currently we are looking into the right way to synchronise with
> > + * freeze_super().
> > + */
> > +static int iomap_writethrough_iter(struct kiocb *iocb, struct iomap_iter *iter,
> > +				   struct iov_iter *i,
> > +				   const struct iomap_writethrough_ops *wt_ops)
> > +{
> > +	ssize_t total_written = 0;
> > +	int status = 0;
> > +	struct address_space *mapping = iter->inode->i_mapping;
> > +	size_t chunk = mapping_max_folio_size(mapping);
> > +	unsigned int bdp_flags = (iter->flags & IOMAP_NOWAIT) ? BDP_ASYNC : 0;
> > +
> > +	if (!(iter->flags & IOMAP_WRITETHROUGH))
> > +		return -EINVAL;
> > +
> > +	do {
> ......
> > +		status = iomap_write_begin(iter, wt_ops->write_ops, &folio,
> > +					   &offset, &bytes);
> > +		if (unlikely(status)) {
> > +			iomap_write_failed(iter->inode, iter->pos, bytes);
> > +			break;
> > +		}
> > +		if (iter->iomap.flags & IOMAP_F_STALE)
> > +			break;
> > +
> > +		pos = iter->pos;
> > +
> > +		if (mapping_writably_mapped(mapping))
> > +			flush_dcache_folio(folio);
> > +
> > +		copied = copy_folio_from_iter_atomic(folio, offset, bytes, i);
> > +		written = iomap_write_end(iter, bytes, copied, folio) ?
> > +			  copied : 0;
> ......
> > +		/*
> > +		 * The copy-to-folio operation succeeded. Lets use the dio
> > +		 * machinery to send the writethrough IO.
> > +		 */
> > +		if (written) {
> > +			struct iomap_writethrough_ctx *wt_ctx;
> > +			int dio_flags = IOMAP_DIO_BUF_WRITETHROUGH;
> > +			struct iov_iter iov_wt;
> > +
> > +			wt_ctx = kzalloc(sizeof(struct iomap_writethrough_ctx),
> > +					GFP_KERNEL | GFP_NOFS);
> > +			if (!wt_ctx) {
> > +				status = -ENOMEM;
> > +				written = 0;
> > +				goto put_folio;
> > +			}
> > +
> > +			status = iomap_writethrough_begin(iocb, folio, iter,
> > +							  wt_ctx, &iov_wt,
> > +							  offset, written);
> > +			if (status < 0) {
> > +				if (status != -ENOMEM)
> > +					noretry = true;
> > +				written = 0;
> > +				kfree(wt_ctx);
> > +				goto put_folio;
> > +			}
> > +
> > +			/* Dont retry for any failures in writethrough */
> > +			noretry = true;
> > +
> > +			status = iomap_dio_rw(&wt_ctx->iocb, &iov_wt,
> > +					      wt_ops->ops, wt_ops->dio_ops,
> > +					      dio_flags, NULL, 0);
> .....
> 
> This is not what I envisiaged write-through using DIO to look like.
> This is a DIO per folio, rather than a DIO per write() syscall. We
> want the latter to be the common case, not the former, especially
> when it comes to RWF_ATOMIC support.
> 
> i.e. I was expecting something more like having a wt context
> allocated up front with an appropriately sized bvec appended to it
> (i.e. single allocation for the common case). Then in
> iomap_write_end(), we'd mark the folio as under writeback and add it
> to the bvec. Then we iterate through the IO range adding folio after
> folio to the bvec.
> 
> When the bvec is full or we reach the end of the IO, we then push
> that bvec down to the DIO code. Ideally we'd also push the iomap we
> already hold down as well, so that the DIO code does not need to
> look it up again (unless the mapping is stale). The DIO completion
> callback then runs a completion callback that iterates the folios
> attached ot the bvec and runs buffered writeback compeltion on them.
> It can then decrements the wt-ctx IO-in-flight counter.
> 
> If there is more user data to submit, we keep going around (with a
> new bvec if we need it) adding folios and submitting them to the dio
> code until there is no more data to copy in and submit.
> 
> The writethrough context then drops it's own "in-flight" reference
> and waits for the in-flight counter to go to zero.

Hi Dave,

Thanks for the review. IIUC you are suggesting a per iomap submission of
dio rather than a per folio, and for each iomap we submit we can
maintain a per writethrough counter which helps us perform any sort of
endio cleanup work. I can give this design a try in v2.

> 
> 
> > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > index c24d94349ca5..f4d8ff08a83a 100644
> > --- a/fs/iomap/direct-io.c
> > +++ b/fs/iomap/direct-io.c
> > @@ -713,7 +713,8 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> >  	dio->i_size = i_size_read(inode);
> >  	dio->dops = dops;
> >  	dio->error = 0;
> > -	dio->flags = dio_flags & (IOMAP_DIO_FSBLOCK_ALIGNED | IOMAP_DIO_BOUNCE);
> > +	dio->flags = dio_flags & (IOMAP_DIO_FSBLOCK_ALIGNED | IOMAP_DIO_BOUNCE |
> > +				  IOMAP_DIO_BUF_WRITETHROUGH);
> >  	dio->done_before = done_before;
> >  
> >  	dio->submit.iter = iter;
> > @@ -747,8 +748,13 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> >  		if (iocb->ki_flags & IOCB_ATOMIC)
> >  			iomi.flags |= IOMAP_ATOMIC;
> >  
> > -		/* for data sync or sync, we need sync completion processing */
> > -		if (iocb_is_dsync(iocb)) {
> > +		/*
> > +		 * for data sync or sync, we need sync completion processing.
> > +		 * for buffered writethrough, sync is handled in buffered IO
> > +		 * path so not needed here
> > +		 */
> > +		if (iocb_is_dsync(iocb) &&
> > +		    !(dio->flags & IOMAP_DIO_BUF_WRITETHROUGH)) {
> >  			dio->flags |= IOMAP_DIO_NEED_SYNC;
> 
> Ah, that looks wrong. We want writethrough to be able to use FUA
> optimisations for RWF_DSYNC. This prevents the DIO write for wt from
> setting IOMAP_DIO_WRITE_THROUGH which is needed to trigger FUA
> writes for RWF_DSYNC ops.
> 
> i.e. we need DIO to handle the write completions directly to allow
> conditional calling of generic_write_sync() based on whether FUA
> writes were used or not.

Yes right, for now we just let xfs_file_buffered_write() ->
generic_write_sync() to handle the sync because first we wanted to have
some discussion on how to correctly implement optimized
O_DSYNC/RWF_DSYNC. 

Some open questions that I have right now:

1. For non-aio non FUA writethrough, where is the right place to do the
   sync? We can't simply rely on iomap_dio_complete() to do the sync
   since we still hold writeback bit and that causes a deadlock. Even if
   we solve that, we need to have a way to propogate any fsync errors
   back to user so endio might not be the right place anyways?

2. For non-aio writethrough, if we do want to do the sync via
   xfs_file_buffered_write() -> generic_write_sync(), we need a way to
   propogate IOMAP_DIO_WRITE_THROUGH to the higher level so that we can
   skip the sync.

Also, a naive question, usually DSYNC means that by the time the
syscall returns we'd either know data has reached the medium or we will
get an error. Even in aio context I think we respect this semantic
currently. However, with our idea of making the DSYNC buffered aio also
truly async, via writethrough, won't we be violating this guarantee?

Regards,
Ojaswin

> 
> > @@ -765,35 +771,47 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> >  		}
> >  
> >  		/*
> > -		 * i_size updates must to happen from process context.
> > +		 * i_size updates must to happen from process context. For
> > +		 * buffered writetthrough, caller might have already changed the
> > +		 * i_size but still needs endio i_size handling. We can't detect
> > +		 * this here so just use process context unconditionally.
> >  		 */
> > -		if (iomi.pos + iomi.len > dio->i_size)
> > +		if ((iomi.pos + iomi.len > dio->i_size) ||
> > +		    dio_flags & IOMAP_DIO_BUF_WRITETHROUGH)
> >  			dio->flags |= IOMAP_DIO_COMP_WORK;
> 
> This is only true because you called i_size_write() in
> iomap_writethrough_iter() before calling down into the DIO code.
> We should only update i_size on completion of the write-through IO,
> not before we've submitted the IO.
> 
> i.e. i_size_write() should only be called by the IO completion that
> drops the wt-ctx in-flight counter to zero. i.e. i_size should not
> change until the entire IO is complete, it should not be updated
> after each folio has the data copied into it.
> 
> >  		/*
> >  		 * Try to invalidate cache pages for the range we are writing.
> >  		 * If this invalidation fails, let the caller fall back to
> >  		 * buffered I/O.
> > +		 *
> > +		 * The execption is if we are using dio path for buffered
> > +		 * RWF_WRITETHROUGH in which case we cannot inavlidate the pages
> > +		 * as we are writing them through and already hold their
> > +		 * folio_lock. For the same reason, disable end of write invalidation
> >  		 */
> > -		ret = kiocb_invalidate_pages(iocb, iomi.len);
> > -		if (ret) {
> > -			if (ret != -EAGAIN) {
> > -				trace_iomap_dio_invalidate_fail(inode, iomi.pos,
> > -								iomi.len);
> > -				if (iocb->ki_flags & IOCB_ATOMIC) {
> > -					/*
> > -					 * folio invalidation failed, maybe
> > -					 * this is transient, unlock and see if
> > -					 * the caller tries again.
> > -					 */
> > -					ret = -EAGAIN;
> > -				} else {
> > -					/* fall back to buffered write */
> > -					ret = -ENOTBLK;
> > +		if (!(dio_flags & IOMAP_DIO_BUF_WRITETHROUGH)) {
> > +			ret = kiocb_invalidate_pages(iocb, iomi.len);
> > +			if (ret) {
> > +				if (ret != -EAGAIN) {
> > +					trace_iomap_dio_invalidate_fail(inode, iomi.pos,
> > +									iomi.len);
> > +					if (iocb->ki_flags & IOCB_ATOMIC) {
> > +						/*
> > +						* folio invalidation failed, maybe
> > +						* this is transient, unlock and see if
> > +						* the caller tries again.
> > +						*/
> > +						ret = -EAGAIN;
> > +					} else {
> > +						/* fall back to buffered write */
> > +						ret = -ENOTBLK;
> > +					}
> >  				}
> > +				goto out_free_dio;
> >  			}
> > -			goto out_free_dio;
> > -		}
> > +		} else
> > +			dio->flags |= IOMAP_DIO_NO_INVALIDATE;
> >  	}
> 
> Waaaay too much indent. It is time to start factoring
> __iomap_dio_rw() - it is turning into spaghetti with all these
> conditional behaviours.
> 
> >  
> >  	if (!wait_for_completion && !inode->i_sb->s_dio_done_wq) {
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 8b3dd145b25e..ca291957140e 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -346,6 +346,7 @@ struct readahead_control;
> >  #define IOCB_ATOMIC		(__force int) RWF_ATOMIC
> >  #define IOCB_DONTCACHE		(__force int) RWF_DONTCACHE
> >  #define IOCB_NOSIGNAL		(__force int) RWF_NOSIGNAL
> > +#define IOCB_WRITETHROUGH	(__force int) RWF_WRITETHROUGH
> >  
> >  /* non-RWF related bits - start at 16 */
> >  #define IOCB_EVENTFD		(1 << 16)
> > @@ -1985,6 +1986,8 @@ struct file_operations {
> >  #define FOP_ASYNC_LOCK		((__force fop_flags_t)(1 << 6))
> >  /* File system supports uncached read/write buffered IO */
> >  #define FOP_DONTCACHE		((__force fop_flags_t)(1 << 7))
> > +/* File system supports write through buffered IO */
> > +#define FOP_WRITETHROUGH	((__force fop_flags_t)(1 << 8))
> >  
> >  /* Wrap a directory iterator that needs exclusive inode access */
> >  int wrap_directory_iterator(struct file *, struct dir_context *,
> > @@ -3436,6 +3439,10 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags,
> >  		if (IS_DAX(ki->ki_filp->f_mapping->host))
> >  			return -EOPNOTSUPP;
> >  	}
> > +	if (flags & RWF_WRITETHROUGH)
> > +		/* file system must support it */
> > +		if (!(ki->ki_filp->f_op->fop_flags & FOP_WRITETHROUGH))
> > +			return -EOPNOTSUPP;
> 
> Needs {}
> 
> >  	kiocb_flags |= (__force int) (flags & RWF_SUPPORTED);
> >  	if (flags & RWF_SYNC)
> >  		kiocb_flags |= IOCB_DSYNC;
> > diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> > index 531f9ebdeeae..b96574bb2918 100644
> > --- a/include/linux/iomap.h
> > +++ b/include/linux/iomap.h
> > @@ -209,6 +209,7 @@ struct iomap_write_ops {
> >  #endif /* CONFIG_FS_DAX */
> >  #define IOMAP_ATOMIC		(1 << 9) /* torn-write protection */
> >  #define IOMAP_DONTCACHE		(1 << 10)
> > +#define IOMAP_WRITETHROUGH	(1 << 11)
> >  
> >  struct iomap_ops {
> >  	/*
> > @@ -475,6 +476,15 @@ struct iomap_writepage_ctx {
> >  	void			*wb_ctx;	/* pending writeback context */
> >  };
> >  
> > +struct iomap_writethrough_ctx {
> > +	struct kiocb iocb;
> > +	struct folio *folio;
> > +	struct inode *inode;
> > +	struct bio_vec *bvec;
> > +	loff_t orig_pos;
> > +	loff_t orig_len;
> > +};
> > +
> >  struct iomap_ioend *iomap_init_ioend(struct inode *inode, struct bio *bio,
> >  		loff_t file_offset, u16 ioend_flags);
> >  struct iomap_ioend *iomap_split_ioend(struct iomap_ioend *ioend,
> > @@ -590,6 +600,14 @@ struct iomap_dio_ops {
> >   */
> >  #define IOMAP_DIO_BOUNCE		(1 << 4)
> >  
> > +/*
> > + * Set when we are using the dio path to perform writethrough for
> > + * RWF_WRITETHROUGH buffered write. The ->endio handler must check this
> > + * to perform any writethrough related cleanup like ending writeback on
> > + * a folio.
> > + */
> > +#define IOMAP_DIO_BUF_WRITETHROUGH	(1 << 5)
> 
> I suspect that iomap should provide the dio ->endio handler itself
> to run the higher level buffered IO completion handling. i.e. we
> have callbacks for custom endio handling, I'm not sure that we need
> logic flags for that...

Got it Dave, will look into this. Thanks for the review.

Regards,
Ojaswin

> 
> -Dave.
> -- 
> Dave Chinner
> dgc@kernel.org

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 1/3] iomap: Support buffered RWF_WRITETHROUGH via async dio backend
  2026-03-11 10:35     ` Ojaswin Mujoo
@ 2026-03-11 12:05       ` Dave Chinner
  2026-03-13  7:43         ` Ojaswin Mujoo
  0 siblings, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2026-03-11 12:05 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: linux-xfs, linux-fsdevel, djwong, john.g.garry, willy, hch,
	ritesh.list, jack, Luis Chamberlain, tytso, p.raghav, andres,
	linux-kernel

On Wed, Mar 11, 2026 at 04:05:29PM +0530, Ojaswin Mujoo wrote:
> On Tue, Mar 10, 2026 at 05:48:12PM +1100, Dave Chinner wrote:
> > On Mon, Mar 09, 2026 at 11:04:31PM +0530, Ojaswin Mujoo wrote:
> > This is not what I envisiaged write-through using DIO to look like.
> > This is a DIO per folio, rather than a DIO per write() syscall. We
> > want the latter to be the common case, not the former, especially
> > when it comes to RWF_ATOMIC support.
> > 
> > i.e. I was expecting something more like having a wt context
> > allocated up front with an appropriately sized bvec appended to it
> > (i.e. single allocation for the common case). Then in
> > iomap_write_end(), we'd mark the folio as under writeback and add it
> > to the bvec. Then we iterate through the IO range adding folio after
> > folio to the bvec.
> > 
> > When the bvec is full or we reach the end of the IO, we then push
> > that bvec down to the DIO code. Ideally we'd also push the iomap we
> > already hold down as well, so that the DIO code does not need to
> > look it up again (unless the mapping is stale). The DIO completion
> > callback then runs a completion callback that iterates the folios
> > attached ot the bvec and runs buffered writeback compeltion on them.
> > It can then decrements the wt-ctx IO-in-flight counter.
> > 
> > If there is more user data to submit, we keep going around (with a
> > new bvec if we need it) adding folios and submitting them to the dio
> > code until there is no more data to copy in and submit.
> > 
> > The writethrough context then drops it's own "in-flight" reference
> > and waits for the in-flight counter to go to zero.
> 
> Hi Dave,
> 
> Thanks for the review. IIUC you are suggesting a per iomap submission of
> dio rather than a per folio,

Yes, this is the original architectural premise of iomap: we map the
extent first, then iterate over folios, then submit a single bio for
the extent...

> and for each iomap we submit we can
> maintain a per writethrough counter which helps us perform any sort of
> endio cleanup work. I can give this design a try in v2.

Yes, this is exactly how iomap DIO completion tracking works for
IO that requires multiple bios to be submitted. i.e. completion
processing only runs once all IOs -and submission- have completed.

> > > index c24d94349ca5..f4d8ff08a83a 100644
> > > --- a/fs/iomap/direct-io.c
> > > +++ b/fs/iomap/direct-io.c
> > > @@ -713,7 +713,8 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> > >  	dio->i_size = i_size_read(inode);
> > >  	dio->dops = dops;
> > >  	dio->error = 0;
> > > -	dio->flags = dio_flags & (IOMAP_DIO_FSBLOCK_ALIGNED | IOMAP_DIO_BOUNCE);
> > > +	dio->flags = dio_flags & (IOMAP_DIO_FSBLOCK_ALIGNED | IOMAP_DIO_BOUNCE |
> > > +				  IOMAP_DIO_BUF_WRITETHROUGH);
> > >  	dio->done_before = done_before;
> > >  
> > >  	dio->submit.iter = iter;
> > > @@ -747,8 +748,13 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> > >  		if (iocb->ki_flags & IOCB_ATOMIC)
> > >  			iomi.flags |= IOMAP_ATOMIC;
> > >  
> > > -		/* for data sync or sync, we need sync completion processing */
> > > -		if (iocb_is_dsync(iocb)) {
> > > +		/*
> > > +		 * for data sync or sync, we need sync completion processing.
> > > +		 * for buffered writethrough, sync is handled in buffered IO
> > > +		 * path so not needed here
> > > +		 */
> > > +		if (iocb_is_dsync(iocb) &&
> > > +		    !(dio->flags & IOMAP_DIO_BUF_WRITETHROUGH)) {
> > >  			dio->flags |= IOMAP_DIO_NEED_SYNC;
> > 
> > Ah, that looks wrong. We want writethrough to be able to use FUA
> > optimisations for RWF_DSYNC. This prevents the DIO write for wt from
> > setting IOMAP_DIO_WRITE_THROUGH which is needed to trigger FUA
> > writes for RWF_DSYNC ops.
> > 
> > i.e. we need DIO to handle the write completions directly to allow
> > conditional calling of generic_write_sync() based on whether FUA
> > writes were used or not.
> 
> Yes right, for now we just let xfs_file_buffered_write() ->
> generic_write_sync() to handle the sync because first we wanted to have
> some discussion on how to correctly implement optimized
> O_DSYNC/RWF_DSYNC. 

Ah, I had assumed that discussion was largely unnecessary because it
was obvious to me how to implement writethrough behaviour. i.e.  all
we need to do is replicate the iomap DIO internal
submission/completion model for wt around the outside of the async
DIO write submission loop, and we are largely done.

> Some open questions that I have right now:
> 
> 1. For non-aio non FUA writethrough, where is the right place to do the
>    sync?

At the end of final wt ctx IO completion, just like we do for DIO.

>    We can't simply rely on iomap_dio_complete() to do the sync
>    since we still hold writeback bit and that causes a deadlock.

Right, you do it at wt ctx IO completion after all the folios in the
range have been marked clean. At that point, all that remains is for
the device cache to be flushed and the metadata sync operations to
be performed.

i.e. This is exactly the same integrity requirement as non-FUA
RWF_DSYNC DIO.

>    Even if
>    we solve that, we need to have a way to propogate any fsync errors
>    back to user so endio might not be the right place anyways?

It's the same model as DIO. If we have async submission and the IO
is not complete, we return -EIOCBQUEUED. Otherwise we gather the
error from the completed wt ctx and return that.

> 2. For non-aio writethrough, if we do want to do the sync via
>    xfs_file_buffered_write() -> generic_write_sync(),

We don't want to do that. This crappy "caller submits and waits for
IO" model is the primary reasons we can't do async buffered
RWF_DSYNC.

The sync operation needs to be run at completion processing. If we
are not doing AIO, then the submitter waits for all submitted IOs to
complete, then runs completion processing itself. IO errors are
collected directly and returned to the submitter.

If we are doing AIO, the the submitter drops it's IO reference, and
then the final IO completion that runs will execute the sync
operation, and the result is returned to the AIO completion ring
via the iocb->ki_complete() callback.

This is exactly the same model as the iomap DIO code - it's lifted
up a layer to the buffered WT layer and wrapped around multiple
async DIO submissions...

>    `we need a way to
>    propogate IOMAP_DIO_WRITE_THROUGH to the higher level so that we can
>    skip the sync.

The sync disappears completely from the higher layers - to take
advantage of FUA optimisations, the sync operations need to be
handled by the WT code. i.e. Buffered DSYNC or OSYNC writes are
-always- write-through operations after this infrastructure is put
in place, they are never run by high level code like
xfs_file_buffered_write().

Indeed, do you see generic_write_sync() calls in the XFS DIO write
paths?

> Also, a naive question, usually DSYNC means that by the time the
> syscall returns we'd either know data has reached the medium or we will
> get an error.  Even in aio context I think we respect this semantic
> currently.

For a write() style syscall, yes. For AIO/io_uring, no.

io_submit() only returns an error if there is something wrong
with the aio ctx or iocbs being submitted. It does not report
completion status of the iocbs that are submitted. You need to call
io_getevents() to obtain the completion status of individual iocbs
that have been submitted via io_submit().

Think about it: if you submit 16 IO in on io_submit() call and
one fails, how do you know find out which IO failed?

> However, with our idea of making the DSYNC buffered aio also
> truly async, via writethrough, won't we be violating this guarantee?

No, the error will be returned to the AIO completion ring, same as
it is now.

-Dave.
-- 
Dave Chinner
dgc@kernel.org

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 1/3] iomap: Support buffered RWF_WRITETHROUGH via async dio backend
  2026-03-11 12:05       ` Dave Chinner
@ 2026-03-13  7:43         ` Ojaswin Mujoo
  0 siblings, 0 replies; 11+ messages in thread
From: Ojaswin Mujoo @ 2026-03-13  7:43 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-xfs, linux-fsdevel, djwong, john.g.garry, willy, hch,
	ritesh.list, jack, Luis Chamberlain, tytso, p.raghav, andres,
	linux-kernel

On Wed, Mar 11, 2026 at 11:05:05PM +1100, Dave Chinner wrote:
> On Wed, Mar 11, 2026 at 04:05:29PM +0530, Ojaswin Mujoo wrote:
> > On Tue, Mar 10, 2026 at 05:48:12PM +1100, Dave Chinner wrote:
> > > On Mon, Mar 09, 2026 at 11:04:31PM +0530, Ojaswin Mujoo wrote:
> > > This is not what I envisiaged write-through using DIO to look like.
> > > This is a DIO per folio, rather than a DIO per write() syscall. We
> > > want the latter to be the common case, not the former, especially
> > > when it comes to RWF_ATOMIC support.
> > > 
> > > i.e. I was expecting something more like having a wt context
> > > allocated up front with an appropriately sized bvec appended to it
> > > (i.e. single allocation for the common case). Then in
> > > iomap_write_end(), we'd mark the folio as under writeback and add it
> > > to the bvec. Then we iterate through the IO range adding folio after
> > > folio to the bvec.
> > > 
> > > When the bvec is full or we reach the end of the IO, we then push
> > > that bvec down to the DIO code. Ideally we'd also push the iomap we
> > > already hold down as well, so that the DIO code does not need to
> > > look it up again (unless the mapping is stale). The DIO completion
> > > callback then runs a completion callback that iterates the folios
> > > attached ot the bvec and runs buffered writeback compeltion on them.
> > > It can then decrements the wt-ctx IO-in-flight counter.
> > > 
> > > If there is more user data to submit, we keep going around (with a
> > > new bvec if we need it) adding folios and submitting them to the dio
> > > code until there is no more data to copy in and submit.
> > > 
> > > The writethrough context then drops it's own "in-flight" reference
> > > and waits for the in-flight counter to go to zero.
> > 
> > Hi Dave,
> > 
> > Thanks for the review. IIUC you are suggesting a per iomap submission of
> > dio rather than a per folio,
> 
> Yes, this is the original architectural premise of iomap: we map the
> extent first, then iterate over folios, then submit a single bio for
> the extent...
> 
> > and for each iomap we submit we can
> > maintain a per writethrough counter which helps us perform any sort of
> > endio cleanup work. I can give this design a try in v2.
> 
> Yes, this is exactly how iomap DIO completion tracking works for
> IO that requires multiple bios to be submitted. i.e. completion
> processing only runs once all IOs -and submission- have completed.
> 
> > > > index c24d94349ca5..f4d8ff08a83a 100644
> > > > --- a/fs/iomap/direct-io.c
> > > > +++ b/fs/iomap/direct-io.c
> > > > @@ -713,7 +713,8 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> > > >  	dio->i_size = i_size_read(inode);

<...>

> > currently.
> 
> For a write() style syscall, yes. For AIO/io_uring, no.
> 
> io_submit() only returns an error if there is something wrong
> with the aio ctx or iocbs being submitted. It does not report
> completion status of the iocbs that are submitted. You need to call
> io_getevents() to obtain the completion status of individual iocbs
> that have been submitted via io_submit().
> 
> Think about it: if you submit 16 IO in on io_submit() call and
> one fails, how do you know find out which IO failed?
> 
> > However, with our idea of making the DSYNC buffered aio also
> > truly async, via writethrough, won't we be violating this guarantee?
> 
> No, the error will be returned to the AIO completion ring, same as
> it is now.

Thanks for the pointers Dave, I now have a decent picture of how
O_DSYNC/RWF_DSYNC IO will look like with writethrough. I'll try to
incorporate this in the next version, along with your other suggestions.

Regards,
ojaswin

> 
> -Dave.
> -- 
> Dave Chinner
> dgc@kernel.org

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2026-03-13  7:43 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-09 17:34 [RFC 0/3] Add buffered write-through support to iomap & xfs Ojaswin Mujoo
2026-03-09 17:34 ` [RFC 1/3] iomap: Support buffered RWF_WRITETHROUGH via async dio backend Ojaswin Mujoo
2026-03-10  6:48   ` Dave Chinner
2026-03-11 10:35     ` Ojaswin Mujoo
2026-03-11 12:05       ` Dave Chinner
2026-03-13  7:43         ` Ojaswin Mujoo
2026-03-09 17:34 ` [RFC 2/3] iomap: Enable stable writes for RWF_WRITETHROUGH inodes Ojaswin Mujoo
2026-03-10  3:57   ` Darrick J. Wong
2026-03-10  5:25     ` Ritesh Harjani
2026-03-11  6:27       ` Ojaswin Mujoo
2026-03-09 17:34 ` [RFC 3/3] xfs: Add RWF_WRITETHROUGH support to xfs Ojaswin Mujoo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox