* [RFC PATCH v2 0/5] Add buffered write-through support to iomap & xfs
@ 2026-04-08 18:45 Ojaswin Mujoo
2026-04-08 18:45 ` [RFC PATCH v2 1/5] mm: Refactor folio_clear_dirty_for_io() Ojaswin Mujoo
` (4 more replies)
0 siblings, 5 replies; 6+ messages in thread
From: Ojaswin Mujoo @ 2026-04-08 18:45 UTC (permalink / raw)
To: linux-xfs, linux-fsdevel
Cc: djwong, john.g.garry, willy, hch, ritesh.list, jack,
Luis Chamberlain, dgc, tytso, p.raghav, andres, brauner,
linux-kernel, linux-mm
Hi all,
This is the v2 RFC to add buffered writethrough support to iomap and
xfs. The changes made are mostly to get the writethrough implementation
more inline with how dio handles writes.
** Changes since RFC v1 [3] **
1. In v1, even the non-aio writethrough syscall returned after IO submission
but before waiting for IO to be finished. However, upon revisiting some of the
discussions, we feel that it's more cleaner to keep the behavior similar to dio
ie non-aio variant should only return after the IO completes and report any
issues upon return. Hence v2 now follows the exact pattern of dio where
non-aio writethrough waits for write to finish whereas aio writethrough returns
after submission. This is inline with the discussion here [2].
2. Instead of submitting a bio per folio, we now submit a bio per iomap.
Only once all IO are complete, we call the completion function to invoke
FS specific ->end_io()
3. Instead of reusing dio code, we have open coded the IO submission and
completion. Althrough this is heavily inspired by dio, trying to hammer
buffered writethrough handling in iomap_dio_rw() was resulting in ugly
if elses and hard to follow code. The open coded variant is clean and
easier to follow however ideally we should try to factor out common
parts of dio code to have a cleaner interface.
4. Support for aio and DSYNC writethrough is added, which utilizes FUA
optimizations if available.
5. Added a new ->writethrough_submit() operation which allows FSes to
perform tasks before IO submission. Like converting COW mappins to written.
The motivation is explained in patch 3
6. Refactored folio_clear_dirty_for_io() so it can be reused without
having to call folio_mkclean(). This is because writethrough mkcleans
the folio in all the cases but only clears dirty bit if the whole folio
is about to become clean.
[2] https://lore.kernel.org/all/aZUQKx_C3-qyU4PJ@dread/
[3] https://lore.kernel.org/linux-xfs/cover.1773076216.git.ojaswin@linux.ibm.com/
*** Original Cover ***
Hi all,
This patchset implements an early design prototype of buffered I/O
write-through semantics in linux.
This idea mainly picked up traction to enable RWF_ATOMIC buffered IO [1],
however write-through path can have many use cases beyond atomic writes,
- such as enabling truly async AIO buffered I/O when issued with O_DSYNC
- better scalability for buffered I/O
The implementation of write-through combines the buffered IO frontend
with dio backend, which leads to some interesting interactions.
I've added most of the design notes in respective patches. Please note
that this is an initial RFC to iron out any early design issues. This is
largely based on suggestions from Dave an Jan in [1] so thanks for the
pointers!
* Testing Notes (UPDATED) *
- I've added support for RWF_WRITETHROUGH to fsx and fsstress in
xfstests and these patches survive fsx with integrity verification as
well as fsstress parallel stressing.
- -g quick with blocks size == page size and blocksize < pagesize shows
no new regressions.
* Design TODOs (UPDATED) *
- Evaluate if we need to tag page cache dirty bit in xarray, since
PG_Writeback is already set on the folio.
- Look into a better way to refactor writethrough path by reusing common
parts of dio code.
* Future work (once design is finalized) (UPDATED) *
- Add RWF_ATOMIC support for buffered IO via write-through path
- Add support of other RWF_ flags for write-through buffered I/O path
- Benchmarking numbers and more thorough testing needed.
- ext4 support for writethrough
- Utilize writethrough for normal buffered DSYNC path to get truly async
semantincs for DSYNC
- Look into folio batching support.
As usual, thoughts and suggestions are welcome.
[1] https://lore.kernel.org/all/d0c4d95b-8064-4a7e-996d-7ad40eb4976b@linux.dev/
Regards,
ojaswin
Ojaswin Mujoo (5):
mm: Refactor folio_clear_dirty_for_io()
iomap: Add initial support for buffered RWF_WRITETHROUGH
xfs: Add RWF_WRITETHROUGH support to xfs
iomap: Add aio support to RWF_WRITETHROUGH
iomap: Add DSYNC support to writethrough
fs/iomap/buffered-io.c | 420 ++++++++++++++++++++++++++++++++++++++++
fs/xfs/xfs_file.c | 53 ++++-
include/linux/fs.h | 7 +
include/linux/iomap.h | 45 +++++
include/linux/pagemap.h | 1 +
include/uapi/linux/fs.h | 5 +-
mm/page-writeback.c | 18 +-
7 files changed, 540 insertions(+), 9 deletions(-)
--
2.53.0
^ permalink raw reply [flat|nested] 6+ messages in thread
* [RFC PATCH v2 1/5] mm: Refactor folio_clear_dirty_for_io()
2026-04-08 18:45 [RFC PATCH v2 0/5] Add buffered write-through support to iomap & xfs Ojaswin Mujoo
@ 2026-04-08 18:45 ` Ojaswin Mujoo
2026-04-08 18:45 ` [RFC PATCH v2 2/5] iomap: Add initial support for buffered RWF_WRITETHROUGH Ojaswin Mujoo
` (3 subsequent siblings)
4 siblings, 0 replies; 6+ messages in thread
From: Ojaswin Mujoo @ 2026-04-08 18:45 UTC (permalink / raw)
To: linux-xfs, linux-fsdevel
Cc: djwong, john.g.garry, willy, hch, ritesh.list, jack,
Luis Chamberlain, dgc, tytso, p.raghav, andres, brauner,
linux-kernel, linux-mm
Add a new __folio_clear_dirty_for_io() helper which takes an extra
parameter to indicate folio_mkclean() is needed. This is in preparation
of buffered writethrough support where we already do folio_mkclean()
before calling into this function.
Co-developed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
---
mm/page-writeback.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 601a5e048d12..2f0c6916213d 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2847,8 +2847,11 @@ EXPORT_SYMBOL(__folio_cancel_dirty);
*
* This incoherency between the folio's dirty flag and xarray tag is
* unfortunate, but it only exists while the folio is locked.
+ *
+ * For some cases we might not want to do mkclean, eg, if we've already taken
+ * care of it, hence pass the should_mkclean flag to indicate if its needed.
*/
-bool folio_clear_dirty_for_io(struct folio *folio)
+static bool __folio_clear_dirty_for_io(struct folio *folio, bool should_mkclean)
{
struct address_space *mapping = folio_mapping(folio);
bool ret = false;
@@ -2885,7 +2888,7 @@ bool folio_clear_dirty_for_io(struct folio *folio)
* as a serialization point for all the different
* threads doing their things.
*/
- if (folio_mkclean(folio))
+ if (should_mkclean && folio_mkclean(folio))
folio_mark_dirty(folio);
/*
* We carefully synchronise fault handlers against
@@ -2908,6 +2911,11 @@ bool folio_clear_dirty_for_io(struct folio *folio)
}
return folio_test_clear_dirty(folio);
}
+
+bool folio_clear_dirty_for_io(struct folio *folio)
+{
+ return __folio_clear_dirty_for_io(folio, true);
+}
EXPORT_SYMBOL(folio_clear_dirty_for_io);
static void wb_inode_writeback_start(struct bdi_writeback *wb)
--
2.53.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [RFC PATCH v2 2/5] iomap: Add initial support for buffered RWF_WRITETHROUGH
2026-04-08 18:45 [RFC PATCH v2 0/5] Add buffered write-through support to iomap & xfs Ojaswin Mujoo
2026-04-08 18:45 ` [RFC PATCH v2 1/5] mm: Refactor folio_clear_dirty_for_io() Ojaswin Mujoo
@ 2026-04-08 18:45 ` Ojaswin Mujoo
2026-04-08 18:45 ` [RFC PATCH v2 3/5] xfs: Add RWF_WRITETHROUGH support to xfs Ojaswin Mujoo
` (2 subsequent siblings)
4 siblings, 0 replies; 6+ messages in thread
From: Ojaswin Mujoo @ 2026-04-08 18:45 UTC (permalink / raw)
To: linux-xfs, linux-fsdevel
Cc: djwong, john.g.garry, willy, hch, ritesh.list, jack,
Luis Chamberlain, dgc, tytso, p.raghav, andres, brauner,
linux-kernel, linux-mm
This adds initial support for performing buffered non-aio
RWF_WRITETHROUGH write. The rough flow for a writethrough write is as
follows:
1. Acquire inode lock
2. initialize writethrough context (wt_ctx) and mark
mapping as stable.
3. Start the iomap_iter() loop. For each iomap:
3.1. Acquire folio and folio_lock.
3.2. perform memcpy from user buffer to the folio and mark it
dirty
3.3. Wait for any current writeback to complete and then call
folio_mkclean() to prevent mmap writes from changing it.
3.4. Start writeback on the folio
3.5. Add the folio range under write to wt_ctx->bvec and folio_unlock()
3.6. If bvec is full, submit the current bvecs for IO.
3.7. Repeat 3.2 to 3.6 till the whole iomap is processed. Submit
the final set of bvecs for IO.
4. Repeat step 3 till we have no more data to write.
5. Finally, sleep in the syscall thread till all the IOs are
completed (refcount == 0). Once that happens, the end io handler will
wake us up.
6. Upon waking up, call fs ->end_io() callback (which updates inode
size), record any errors and return.
7. inode_unlock()
This design gives buffered writethrough the same semantics as dio and
any error in the IO is directly returned to the caller. The design has
delibrately open coded the IO submission and completion flow (inspired
by dio) rather than reusing the dio functions as accomodating buffered
writethrough logic in dio code was polluting it with too many if else
conditionals and special cases.
Suggested-by: Jan Kara <jack@suse.cz>
Suggested-by: Dave Chinner <dgc@kernel.org>
Co-developed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
---
fs/iomap/buffered-io.c | 352 ++++++++++++++++++++++++++++++++++++++++
include/linux/fs.h | 7 +
include/linux/iomap.h | 38 +++++
include/linux/pagemap.h | 1 +
include/uapi/linux/fs.h | 5 +-
mm/page-writeback.c | 6 +
6 files changed, 408 insertions(+), 1 deletion(-)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index e4b6886e5c3c..74e1ab108b0f 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -9,6 +9,7 @@
#include <linux/swap.h>
#include <linux/migrate.h>
#include <linux/fserror.h>
+#include <linux/rmap.h>
#include "internal.h"
#include "trace.h"
@@ -1096,6 +1097,276 @@ static bool iomap_write_end(struct iomap_iter *iter, size_t len, size_t copied,
return __iomap_write_end(iter->inode, pos, len, copied, folio);
}
+static ssize_t iomap_writethrough_complete(struct iomap_writethrough_ctx *wt_ctx)
+{
+ struct kiocb *iocb = wt_ctx->iocb;
+ struct inode *inode = wt_ctx->inode;
+ ssize_t ret = wt_ctx->error;
+
+ if (wt_ctx->dops && wt_ctx->dops->end_io) {
+ int err = wt_ctx->dops->end_io(iocb, wt_ctx->written,
+ wt_ctx->error,
+ wt_ctx->flags);
+ if (err)
+ ret = err;
+ }
+
+ mapping_clear_stable_writes(inode->i_mapping);
+
+ if (!ret) {
+ ret = wt_ctx->written;
+ iocb->ki_pos = wt_ctx->pos + ret;
+ }
+
+ kfree(wt_ctx);
+ return ret;
+}
+
+static void iomap_writethrough_done(struct iomap_writethrough_ctx *wt_ctx)
+{
+ struct task_struct *waiter = wt_ctx->waiter;
+
+ WRITE_ONCE(wt_ctx->waiter, NULL);
+ blk_wake_io_task(waiter);
+ return;
+}
+
+static void iomap_writethrough_bio_end_io(struct bio *bio)
+{
+ struct iomap_writethrough_ctx *wt_ctx = bio->bi_private;
+ struct folio_iter fi;
+
+ if (bio->bi_status)
+ cmpxchg(&wt_ctx->error, 0,
+ blk_status_to_errno(bio->bi_status));
+ bio_for_each_folio_all(fi, bio)
+ folio_end_writeback(fi.folio);
+
+ bio_put(bio);
+ if (atomic_dec_and_test(&wt_ctx->ref))
+ iomap_writethrough_done(wt_ctx);
+}
+
+static void
+iomap_writethrough_submit_bio(struct iomap_writethrough_ctx *wt_ctx,
+ struct iomap *iomap,
+ const struct iomap_writethrough_ops *wt_ops)
+{
+ struct bio *bio;
+ unsigned int i;
+ u64 len = 0;
+
+ if (!wt_ctx->nr_bvecs)
+ return;
+
+ for (i = 0; i < wt_ctx->nr_bvecs; i++)
+ len += wt_ctx->bvec[i].bv_len;
+
+ if (wt_ops->writethrough_submit)
+ wt_ops->writethrough_submit(wt_ctx->inode, iomap, wt_ctx->bio_pos,
+ len);
+
+ bio = bio_alloc(iomap->bdev, wt_ctx->nr_bvecs, REQ_OP_WRITE, GFP_NOFS);
+ bio->bi_iter.bi_sector = iomap_sector(iomap, wt_ctx->bio_pos);
+ bio->bi_end_io = iomap_writethrough_bio_end_io;
+ bio->bi_private = wt_ctx;
+
+ for (i = 0; i < wt_ctx->nr_bvecs; i++)
+ __bio_add_page(bio, wt_ctx->bvec[i].bv_page,
+ wt_ctx->bvec[i].bv_len,
+ wt_ctx->bvec[i].bv_offset);
+
+ atomic_inc(&wt_ctx->ref);
+ submit_bio(bio);
+ wt_ctx->nr_bvecs = 0;
+}
+
+/**
+ * iomap_writethrough_begin - prepare the various structures for writethrough
+ * @folio: folio to prepare for writethrough
+ * @off: offset of write within folio
+ * @len: len of write within folio
+ *
+ * This function does the major preparation work needed before starting the
+ * writethrough. The main task is to prepare folio for writeththrough by blocking
+ * mmap writes and setting writeback on it. Further, we must clear the write range
+ * to non-dirty. If this results in the complete folio becoming non-dirty, then we
+ * need to clear the master dirty bit.
+ */
+static void iomap_folio_prepare_writethrough(struct folio *folio, size_t off,
+ size_t len)
+{
+ bool fully_written;
+ u64 zero = 0;
+
+ if (folio_test_writeback(folio))
+ folio_wait_writeback(folio);
+
+ if (folio_mkclean(folio))
+ folio_mark_dirty(folio);
+
+ /*
+ * We might either write through the complete folio or a partial folio
+ * writethrough might result in all blocks becoming non-dirty, so we need to
+ * check and mark the folio clean if that is the case.
+ */
+ fully_written = (off == 0 && len == folio_size(folio));
+ iomap_clear_range_dirty(folio, off, len);
+ if (fully_written ||
+ !iomap_find_dirty_range(folio, &zero, folio_size(folio)))
+ folio_clear_dirty_for_writethrough(folio);
+
+ folio_start_writeback(folio);
+}
+
+/**
+ * iomap_writethrough_iter - perform RWF_WRITETHROUGH buffered write
+ * @wt_ctx: writethrough context
+ * @iter: iomap iter holding mapping information
+ * @i: iov_iter for write
+ * @wt_ops: the fs callbacks needed for writethrough
+ *
+ * This function copies the user buffer to folio similar to usual buffered
+ * IO path, with the difference that we immediately issue the IO. For this we
+ * utilize IO submission and completion mechanism that is inspired by dio.
+ *
+ * Folio handling note: We might be writing through a partial folio so we need
+ * to be careful to not clear the folio dirty bit unless there are no dirty blocks
+ * in the folio after the writethrough.
+ */
+static int iomap_writethrough_iter(struct iomap_writethrough_ctx *wt_ctx,
+ struct iomap_iter *iter, struct iov_iter *i,
+ const struct iomap_writethrough_ops *wt_ops)
+
+{
+ ssize_t total_written = 0;
+ int status = 0;
+ struct address_space *mapping = iter->inode->i_mapping;
+ size_t chunk = mapping_max_folio_size(mapping);
+ unsigned int bdp_flags = (iter->flags & IOMAP_NOWAIT) ? BDP_ASYNC : 0;
+ unsigned int bs = i_blocksize(iter->inode);
+
+ /* copied over based on DIO handles these flags */
+ if (iter->iomap.type == IOMAP_UNWRITTEN)
+ wt_ctx->flags |= IOMAP_DIO_UNWRITTEN;
+ if (iter->iomap.flags & IOMAP_F_SHARED)
+ wt_ctx->flags |= IOMAP_DIO_COW;
+
+ if (!(iter->flags & IOMAP_WRITETHROUGH))
+ return -EINVAL;
+
+ do {
+ struct folio *folio;
+ size_t offset; /* Offset into folio */
+ u64 bytes; /* Bytes to write to folio */
+ size_t copied; /* Bytes copied from user */
+ u64 written; /* Bytes have been written */
+ loff_t pos;
+ size_t off_aligned, len_aligned;
+
+ bytes = iov_iter_count(i);
+retry:
+ offset = iter->pos & (chunk - 1);
+ bytes = min(chunk - offset, bytes);
+ status = balance_dirty_pages_ratelimited_flags(mapping,
+ bdp_flags);
+ if (unlikely(status))
+ break;
+
+ /*
+ * If completions already occurred and reported errors, give up
+ * now and don't bother submitting more bios.
+ */
+ if (unlikely(data_race(wt_ctx->error))) {
+ wt_ctx->nr_bvecs = 0;
+ break;
+ }
+
+ if (bytes > iomap_length(iter))
+ bytes = iomap_length(iter);
+
+ /*
+ * Bring in the user page that we'll copy from _first_.
+ * Otherwise there's a nasty deadlock on copying from the
+ * same page as we're writing to, without it being marked
+ * up-to-date.
+ *
+ * For async buffered writes the assumption is that the user
+ * page has already been faulted in. This can be optimized by
+ * faulting the user page.
+ */
+ if (unlikely(fault_in_iov_iter_readable(i, bytes) == bytes)) {
+ status = -EFAULT;
+ break;
+ }
+
+ status = iomap_write_begin(iter, wt_ops->write_ops, &folio,
+ &offset, &bytes);
+ if (unlikely(status)) {
+ iomap_write_failed(iter->inode, iter->pos, bytes);
+ break;
+ }
+ if (iter->iomap.flags & IOMAP_F_STALE)
+ break;
+
+ pos = iter->pos;
+
+ if (mapping_writably_mapped(mapping))
+ flush_dcache_folio(folio);
+
+ copied = copy_folio_from_iter_atomic(folio, offset, bytes, i);
+ written = iomap_write_end(iter, bytes, copied, folio) ?
+ copied : 0;
+
+ if (!written)
+ goto put_folio;
+
+ off_aligned = round_down(offset, bs);
+ len_aligned = round_up(offset + written, bs) - off_aligned;
+
+ iomap_folio_prepare_writethrough(folio, off_aligned,
+ len_aligned);
+
+ if (!wt_ctx->nr_bvecs)
+ wt_ctx->bio_pos = round_down(pos, bs);
+
+ bvec_set_folio(&wt_ctx->bvec[wt_ctx->nr_bvecs], folio,
+ len_aligned, off_aligned);
+ wt_ctx->nr_bvecs++;
+ wt_ctx->written += written;
+
+ if (pos + written > wt_ctx->new_i_size)
+ wt_ctx->new_i_size = pos + written;
+
+ if (wt_ctx->nr_bvecs == wt_ctx->max_bvecs)
+ iomap_writethrough_submit_bio(wt_ctx, &iter->iomap, wt_ops);
+
+put_folio:
+ __iomap_put_folio(iter, wt_ops->write_ops, written, folio);
+
+ cond_resched();
+ if (unlikely(written == 0)) {
+ iomap_write_failed(iter->inode, pos, bytes);
+ iov_iter_revert(i, copied);
+
+ if (chunk > PAGE_SIZE)
+ chunk /= 2;
+ if (copied) {
+ bytes = copied;
+ goto retry;
+ }
+ } else {
+ total_written += written;
+ iomap_iter_advance(iter, written);
+ }
+ } while (iov_iter_count(i) && iomap_length(iter));
+
+ if (wt_ctx->nr_bvecs)
+ iomap_writethrough_submit_bio(wt_ctx, &iter->iomap, wt_ops);
+
+ return total_written ? 0 : status;
+}
+
static int iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i,
const struct iomap_write_ops *write_ops)
{
@@ -1232,6 +1503,87 @@ iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *i,
}
EXPORT_SYMBOL_GPL(iomap_file_buffered_write);
+ssize_t iomap_file_writethrough_write(struct kiocb *iocb, struct iov_iter *i,
+ const struct iomap_writethrough_ops *wt_ops,
+ void *private)
+{
+ struct inode *inode = iocb->ki_filp->f_mapping->host;
+ struct iomap_iter iter = {
+ .inode = inode,
+ .pos = iocb->ki_pos,
+ .len = iov_iter_count(i),
+ .flags = IOMAP_WRITE | IOMAP_WRITETHROUGH,
+ .private = private,
+ };
+ struct iomap_writethrough_ctx *wt_ctx;
+ unsigned int max_bvecs;
+ ssize_t ret;
+
+
+ /*
+ * For now we don't support any other flag with WRITETHROUGH
+ */
+ if (!(iocb->ki_flags & IOCB_WRITETHROUGH))
+ return -EINVAL;
+ if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_DONTCACHE))
+ return -EINVAL;
+ if (iocb_is_dsync(iocb))
+ /* D_SYNC support not implemented yet */
+ return -EOPNOTSUPP;
+ if (!is_sync_kiocb(iocb))
+ /* aio support not implemented yet */
+ return -EOPNOTSUPP;
+
+ /*
+ * +1 to max bvecs to account for unaligned write spanning multiple
+ * folios
+ */
+ max_bvecs = DIV_ROUND_UP(
+ iov_iter_count(i),
+ PAGE_SIZE << mapping_min_folio_order(inode->i_mapping)) + 1;
+
+ if (max_bvecs > BIO_MAX_VECS)
+ max_bvecs = BIO_MAX_VECS;
+ if (!max_bvecs)
+ max_bvecs = 1;
+
+ wt_ctx = kzalloc(struct_size(wt_ctx, bvec, max_bvecs), GFP_NOFS);
+ if (!wt_ctx)
+ return -ENOMEM;
+
+ wt_ctx->iocb = iocb;
+ wt_ctx->inode = inode;
+ wt_ctx->dops = wt_ops->dops;
+ wt_ctx->pos = iocb->ki_pos;
+ wt_ctx->new_i_size = i_size_read(inode);
+ wt_ctx->max_bvecs = max_bvecs;
+ atomic_set(&wt_ctx->ref, 1);
+ wt_ctx->waiter = current;
+
+ mapping_set_stable_writes(inode->i_mapping);
+
+ while ((ret = iomap_iter(&iter, wt_ops->ops)) > 0) {
+ WARN_ON(iter.iomap.type != IOMAP_UNWRITTEN &&
+ iter.iomap.type != IOMAP_MAPPED);
+ iter.status = iomap_writethrough_iter(wt_ctx, &iter, i, wt_ops);
+ }
+ if (ret < 0)
+ cmpxchg(&wt_ctx->error, 0, ret);
+
+ if (!atomic_dec_and_test(&wt_ctx->ref)) {
+ for (;;) {
+ set_current_state(TASK_UNINTERRUPTIBLE);
+ if (!READ_ONCE(wt_ctx->waiter))
+ break;
+ blk_io_schedule();
+ }
+ __set_current_state(TASK_RUNNING);
+ }
+
+ return iomap_writethrough_complete(wt_ctx);
+}
+EXPORT_SYMBOL_GPL(iomap_file_writethrough_write);
+
static void iomap_write_delalloc_ifs_punch(struct inode *inode,
struct folio *folio, loff_t start_byte, loff_t end_byte,
struct iomap *iomap, iomap_punch_t punch)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 547ce27fb741..2f95fd49472a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -344,6 +344,7 @@ struct readahead_control;
#define IOCB_ATOMIC (__force int) RWF_ATOMIC
#define IOCB_DONTCACHE (__force int) RWF_DONTCACHE
#define IOCB_NOSIGNAL (__force int) RWF_NOSIGNAL
+#define IOCB_WRITETHROUGH (__force int) RWF_WRITETHROUGH
/* non-RWF related bits - start at 16 */
#define IOCB_EVENTFD (1 << 16)
@@ -1985,6 +1986,8 @@ struct file_operations {
#define FOP_ASYNC_LOCK ((__force fop_flags_t)(1 << 6))
/* File system supports uncached read/write buffered IO */
#define FOP_DONTCACHE ((__force fop_flags_t)(1 << 7))
+/* File system supports write through buffered IO */
+#define FOP_WRITETHROUGH ((__force fop_flags_t)(1 << 8))
/* Wrap a directory iterator that needs exclusive inode access */
int wrap_directory_iterator(struct file *, struct dir_context *,
@@ -3434,6 +3437,10 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags,
if (IS_DAX(ki->ki_filp->f_mapping->host))
return -EOPNOTSUPP;
}
+ if (flags & RWF_WRITETHROUGH)
+ /* file system must support it */
+ if (!(ki->ki_filp->f_op->fop_flags & FOP_WRITETHROUGH))
+ return -EOPNOTSUPP;
kiocb_flags |= (__force int) (flags & RWF_SUPPORTED);
if (flags & RWF_SYNC)
kiocb_flags |= IOCB_DSYNC;
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 531f9ebdeeae..661233aa009d 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -209,6 +209,7 @@ struct iomap_write_ops {
#endif /* CONFIG_FS_DAX */
#define IOMAP_ATOMIC (1 << 9) /* torn-write protection */
#define IOMAP_DONTCACHE (1 << 10)
+#define IOMAP_WRITETHROUGH (1 << 11)
struct iomap_ops {
/*
@@ -475,6 +476,27 @@ struct iomap_writepage_ctx {
void *wb_ctx; /* pending writeback context */
};
+struct iomap_writethrough_ctx {
+ struct kiocb *iocb;
+ const struct iomap_dio_ops *dops;
+ struct inode *inode;
+ loff_t new_i_size;
+ loff_t pos;
+ size_t written;
+ atomic_t ref;
+ unsigned int flags;
+ int error;
+
+ /* used during submission and for non-aio completion */
+ struct task_struct *waiter;
+
+ loff_t bio_pos;
+ unsigned int nr_bvecs;
+ unsigned int max_bvecs;
+ struct bio_vec bvec[];
+
+};
+
struct iomap_ioend *iomap_init_ioend(struct inode *inode, struct bio *bio,
loff_t file_offset, u16 ioend_flags);
struct iomap_ioend *iomap_split_ioend(struct iomap_ioend *ioend,
@@ -599,6 +621,22 @@ struct iomap_dio *__iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
ssize_t iomap_dio_complete(struct iomap_dio *dio);
void iomap_dio_bio_end_io(struct bio *bio);
+/*
+ * In writethrough, we copy user data to folio first and then send the folio
+ * to writeback via dio path. To achieve this, we need callbacks from iomap_ops,
+ * iomap_write_ops and iomap_dio_ops. This struct packs them together.
+ */
+struct iomap_writethrough_ops {
+ const struct iomap_ops *ops;
+ const struct iomap_write_ops *write_ops;
+ const struct iomap_dio_ops *dops;
+ int (*writethrough_submit)(struct inode *inode, struct iomap *iomap,
+ loff_t offset, u64 len);
+};
+ssize_t iomap_file_writethrough_write(struct kiocb *iocb, struct iov_iter *i,
+ const struct iomap_writethrough_ops *wt_ops,
+ void *private);
+
#ifdef CONFIG_SWAP
struct file;
struct swap_info_struct;
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 31a848485ad9..192a00422bc8 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -1260,6 +1260,7 @@ static inline void folio_cancel_dirty(struct folio *folio)
__folio_cancel_dirty(folio);
}
bool folio_clear_dirty_for_io(struct folio *folio);
+bool folio_clear_dirty_for_writethrough(struct folio *folio);
bool clear_page_dirty_for_io(struct page *page);
void folio_invalidate(struct folio *folio, size_t offset, size_t length);
bool noop_dirty_folio(struct address_space *mapping, struct folio *folio);
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 70b2b661f42c..dec78041b0cf 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -435,10 +435,13 @@ typedef int __bitwise __kernel_rwf_t;
/* prevent pipe and socket writes from raising SIGPIPE */
#define RWF_NOSIGNAL ((__force __kernel_rwf_t)0x00000100)
+/* buffered IO that is asynchronously written through to disk after write */
+#define RWF_WRITETHROUGH ((__force __kernel_rwf_t)0x00000200)
+
/* mask of flags supported by the kernel */
#define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
RWF_APPEND | RWF_NOAPPEND | RWF_ATOMIC |\
- RWF_DONTCACHE | RWF_NOSIGNAL)
+ RWF_DONTCACHE | RWF_NOSIGNAL | RWF_WRITETHROUGH)
#define PROCFS_IOCTL_MAGIC 'f'
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 2f0c6916213d..20561d3d5eaa 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2918,6 +2918,12 @@ bool folio_clear_dirty_for_io(struct folio *folio)
}
EXPORT_SYMBOL(folio_clear_dirty_for_io);
+bool folio_clear_dirty_for_writethrough(struct folio *folio)
+{
+ return __folio_clear_dirty_for_io(folio, false);
+}
+EXPORT_SYMBOL(folio_clear_dirty_for_writethrough);
+
static void wb_inode_writeback_start(struct bdi_writeback *wb)
{
atomic_inc(&wb->writeback_inodes);
--
2.53.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [RFC PATCH v2 3/5] xfs: Add RWF_WRITETHROUGH support to xfs
2026-04-08 18:45 [RFC PATCH v2 0/5] Add buffered write-through support to iomap & xfs Ojaswin Mujoo
2026-04-08 18:45 ` [RFC PATCH v2 1/5] mm: Refactor folio_clear_dirty_for_io() Ojaswin Mujoo
2026-04-08 18:45 ` [RFC PATCH v2 2/5] iomap: Add initial support for buffered RWF_WRITETHROUGH Ojaswin Mujoo
@ 2026-04-08 18:45 ` Ojaswin Mujoo
2026-04-08 18:45 ` [RFC PATCH v2 4/5] iomap: Add aio support to RWF_WRITETHROUGH Ojaswin Mujoo
2026-04-08 18:45 ` [RFC PATCH v2 5/5] iomap: Add DSYNC support to writethrough Ojaswin Mujoo
4 siblings, 0 replies; 6+ messages in thread
From: Ojaswin Mujoo @ 2026-04-08 18:45 UTC (permalink / raw)
To: linux-xfs, linux-fsdevel
Cc: djwong, john.g.garry, willy, hch, ritesh.list, jack,
Luis Chamberlain, dgc, tytso, p.raghav, andres, brauner,
linux-kernel, linux-mm
Add the boilerplate needed to start supporting RWF_WRITETHROUGH in XFS.
We use the direct wirte ->iomap_begin() functions to ensure the range
under write through always has a real non-delalloc extent. We reuse the xfs
dio's end IO function to perform extent conversion and i_size handling
for us.
*Note on COW extent over DATA hole case*
In case of an unmapped COW extent over a DATA hole
(due to COW preallocations), leave the extent unmapped until we are just
about to send IO. At that time, use the ->writethrough_submit() call
back to convert the COW extent to written.
We initially tried converting during iomap_begin() time (like dio does)
but that results in a stale data exposure as follows:
1. iomap_begin() - converts COW extent over DATA hole to written and
marks IOMAP_F_NEW to handle zeroing.
2. During iomap_write_begin() -> realise extent is stale and return back
without zeroing.
3. iomap_begin() - Again sees the same COW extent but it's written
this time so we don't mark IOMAP_F_NEW
4. Since IOMAP_F_NEW is unmarked, we never zeroout and hence expose
stale data.
To avoid the above, take the buffered IO approach of converting the
extent just before IO, when we are sure to have zeroed out the folio.
Co-developed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
---
fs/xfs/xfs_file.c | 53 +++++++++++++++++++++++++++++++++++++++++------
1 file changed, 47 insertions(+), 6 deletions(-)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 6246f34df9fd..d8436d840476 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -988,6 +988,39 @@ xfs_file_dax_write(
return ret;
}
+static int
+xfs_writethrough_submit(
+ struct inode *inode,
+ struct iomap *iomap,
+ loff_t offset,
+ u64 count)
+{
+ int error = 0;
+ unsigned int nofs_flag;
+
+ /*
+ * Convert CoW extents to regular.
+ *
+ * We are under writethrough context with folio lock possibly held. To
+ * avoid memory allocation deadlocks, set the task-wide nofs context.
+ */
+ if (iomap->flags & IOMAP_F_SHARED) {
+ nofs_flag = memalloc_nofs_save();
+ error = xfs_reflink_convert_cow(XFS_I(inode), offset, count);
+ memalloc_nofs_restore(nofs_flag);
+ }
+
+ return error;
+}
+
+const struct iomap_writethrough_ops xfs_writethrough_ops = {
+ .ops = &xfs_direct_write_iomap_ops,
+ .write_ops = &xfs_iomap_write_ops,
+ .dops = &xfs_dio_write_ops,
+ .writethrough_submit = &xfs_writethrough_submit
+};
+
+
STATIC ssize_t
xfs_file_buffered_write(
struct kiocb *iocb,
@@ -1010,9 +1043,13 @@ xfs_file_buffered_write(
goto out;
trace_xfs_file_buffered_write(iocb, from);
- ret = iomap_file_buffered_write(iocb, from,
- &xfs_buffered_write_iomap_ops, &xfs_iomap_write_ops,
- NULL);
+ if (iocb->ki_flags & IOCB_WRITETHROUGH) {
+ ret = iomap_file_writethrough_write(iocb, from,
+ &xfs_writethrough_ops, NULL);
+ } else
+ ret = iomap_file_buffered_write(iocb, from,
+ &xfs_buffered_write_iomap_ops,
+ &xfs_iomap_write_ops, NULL);
/*
* If we hit a space limit, try to free up some lingering preallocated
@@ -1047,8 +1084,12 @@ xfs_file_buffered_write(
if (ret > 0) {
XFS_STATS_ADD(ip->i_mount, xs_write_bytes, ret);
- /* Handle various SYNC-type writes */
- ret = generic_write_sync(iocb, ret);
+ /*
+ * Handle various SYNC-type writes.
+ * For writethrough, we handle sync during completion.
+ */
+ if (!(iocb->ki_flags & IOCB_WRITETHROUGH))
+ ret = generic_write_sync(iocb, ret);
}
return ret;
}
@@ -2042,7 +2083,7 @@ const struct file_operations xfs_file_operations = {
.remap_file_range = xfs_file_remap_range,
.fop_flags = FOP_MMAP_SYNC | FOP_BUFFER_RASYNC |
FOP_BUFFER_WASYNC | FOP_DIO_PARALLEL_WRITE |
- FOP_DONTCACHE,
+ FOP_DONTCACHE | FOP_WRITETHROUGH,
.setlease = generic_setlease,
};
--
2.53.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [RFC PATCH v2 4/5] iomap: Add aio support to RWF_WRITETHROUGH
2026-04-08 18:45 [RFC PATCH v2 0/5] Add buffered write-through support to iomap & xfs Ojaswin Mujoo
` (2 preceding siblings ...)
2026-04-08 18:45 ` [RFC PATCH v2 3/5] xfs: Add RWF_WRITETHROUGH support to xfs Ojaswin Mujoo
@ 2026-04-08 18:45 ` Ojaswin Mujoo
2026-04-08 18:45 ` [RFC PATCH v2 5/5] iomap: Add DSYNC support to writethrough Ojaswin Mujoo
4 siblings, 0 replies; 6+ messages in thread
From: Ojaswin Mujoo @ 2026-04-08 18:45 UTC (permalink / raw)
To: linux-xfs, linux-fsdevel
Cc: djwong, john.g.garry, willy, hch, ritesh.list, jack,
Luis Chamberlain, dgc, tytso, p.raghav, andres, brauner,
linux-kernel, linux-mm
With aio the only thing we need to be careful off is that writethrough
can be in progress even after dropping inode and folio lock. Due to
this, we need a way to synchronise with other paths where stable write
is not enough, example:
1. Truncate to 0 in xfs sets i_size = 0 before waiting for writeback to
complete. In case of writethrough, the end io completion can again
push the i_size to a non-zero value.
2. Dio reads might race with aio writethrough ->end_io() and read 0s if
unwritten conversion is yet to happen.
Hence use the dio begin/end as it gives us the required guarantees.
Co-developed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
---
fs/iomap/buffered-io.c | 53 ++++++++++++++++++++++++++++++++++++------
include/linux/iomap.h | 10 ++++++--
2 files changed, 54 insertions(+), 9 deletions(-)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 74e1ab108b0f..6937f10e2782 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1113,6 +1113,9 @@ static ssize_t iomap_writethrough_complete(struct iomap_writethrough_ctx *wt_ctx
mapping_clear_stable_writes(inode->i_mapping);
+ if (wt_ctx->is_aio)
+ inode_dio_end(inode);
+
if (!ret) {
ret = wt_ctx->written;
iocb->ki_pos = wt_ctx->pos + ret;
@@ -1122,12 +1125,27 @@ static ssize_t iomap_writethrough_complete(struct iomap_writethrough_ctx *wt_ctx
return ret;
}
+static void iomap_writethrough_complete_work(struct work_struct *work)
+{
+ struct iomap_writethrough_ctx *wt_ctx =
+ container_of(work, struct iomap_writethrough_ctx, aio_work);
+ struct kiocb *iocb = wt_ctx->iocb;
+
+ iocb->ki_complete(iocb, iomap_writethrough_complete(wt_ctx));
+}
+
static void iomap_writethrough_done(struct iomap_writethrough_ctx *wt_ctx)
{
- struct task_struct *waiter = wt_ctx->waiter;
+ if (!wt_ctx->is_aio) {
+ struct task_struct *waiter = wt_ctx->waiter;
- WRITE_ONCE(wt_ctx->waiter, NULL);
- blk_wake_io_task(waiter);
+ WRITE_ONCE(wt_ctx->waiter, NULL);
+ blk_wake_io_task(waiter);
+ return;
+ }
+
+ INIT_WORK(&wt_ctx->aio_work, iomap_writethrough_complete_work);
+ queue_work(wt_ctx->inode->i_sb->s_dio_done_wq, &wt_ctx->aio_work);
return;
}
@@ -1530,9 +1548,6 @@ ssize_t iomap_file_writethrough_write(struct kiocb *iocb, struct iov_iter *i,
if (iocb_is_dsync(iocb))
/* D_SYNC support not implemented yet */
return -EOPNOTSUPP;
- if (!is_sync_kiocb(iocb))
- /* aio support not implemented yet */
- return -EOPNOTSUPP;
/*
* +1 to max bvecs to account for unaligned write spanning multiple
@@ -1557,11 +1572,32 @@ ssize_t iomap_file_writethrough_write(struct kiocb *iocb, struct iov_iter *i,
wt_ctx->pos = iocb->ki_pos;
wt_ctx->new_i_size = i_size_read(inode);
wt_ctx->max_bvecs = max_bvecs;
+ wt_ctx->is_aio = !is_sync_kiocb(iocb);
atomic_set(&wt_ctx->ref, 1);
- wt_ctx->waiter = current;
+
+ if (!wt_ctx->is_aio)
+ wt_ctx->waiter = current;
+ else
+ /*
+ * With aio, writethrough can be in progress even after dropping
+ * inode and folio lock. Due to this, we need a way to
+ * synchronise with other paths where stable write is not enough
+ * (example truncate). Hence use the dio begin/end as it gives
+ * us the required guarantees.
+ */
+ inode_dio_begin(inode);
mapping_set_stable_writes(inode->i_mapping);
+ if (wt_ctx->is_aio && !inode->i_sb->s_dio_done_wq) {
+ ret = sb_init_dio_done_wq(inode->i_sb);
+ if (ret < 0) {
+ mapping_clear_stable_writes(inode->i_mapping);
+ kfree(wt_ctx);
+ return ret;
+ }
+ }
+
while ((ret = iomap_iter(&iter, wt_ops->ops)) > 0) {
WARN_ON(iter.iomap.type != IOMAP_UNWRITTEN &&
iter.iomap.type != IOMAP_MAPPED);
@@ -1571,6 +1607,9 @@ ssize_t iomap_file_writethrough_write(struct kiocb *iocb, struct iov_iter *i,
cmpxchg(&wt_ctx->error, 0, ret);
if (!atomic_dec_and_test(&wt_ctx->ref)) {
+ if (wt_ctx->is_aio)
+ return -EIOCBQUEUED;
+
for (;;) {
set_current_state(TASK_UNINTERRUPTIBLE);
if (!READ_ONCE(wt_ctx->waiter))
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 661233aa009d..e99f7c279dc6 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -486,9 +486,15 @@ struct iomap_writethrough_ctx {
atomic_t ref;
unsigned int flags;
int error;
+ bool is_aio;
- /* used during submission and for non-aio completion */
- struct task_struct *waiter;
+ union {
+ /* used during submission and for non-aio completion */
+ struct task_struct *waiter;
+
+ /* used during aio completion */
+ struct work_struct aio_work;
+ };
loff_t bio_pos;
unsigned int nr_bvecs;
--
2.53.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [RFC PATCH v2 5/5] iomap: Add DSYNC support to writethrough
2026-04-08 18:45 [RFC PATCH v2 0/5] Add buffered write-through support to iomap & xfs Ojaswin Mujoo
` (3 preceding siblings ...)
2026-04-08 18:45 ` [RFC PATCH v2 4/5] iomap: Add aio support to RWF_WRITETHROUGH Ojaswin Mujoo
@ 2026-04-08 18:45 ` Ojaswin Mujoo
4 siblings, 0 replies; 6+ messages in thread
From: Ojaswin Mujoo @ 2026-04-08 18:45 UTC (permalink / raw)
To: linux-xfs, linux-fsdevel
Cc: djwong, john.g.garry, willy, hch, ritesh.list, jack,
Luis Chamberlain, dgc, tytso, p.raghav, andres, brauner,
linux-kernel, linux-mm
Add DSYNC support to writethrough buffered writes. Unlike the usual
buffered writes where we call generic_write_sync() inline during the
syscall path, for writethrough we instead sync the data during IO
completion path, just like dio.
This allows aio writethrough to be truly async where the syscall can
return after IO submission and the sync can then be done asynchronously
during IO completion time.
Further, just like dio, we utilize the FUA optimization, if available,
to avoid syncing the data for DSYNC operations.
Suggested-by: Dave Chinner <dgc@kernel.org>
Co-developed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
---
fs/iomap/buffered-io.c | 37 +++++++++++++++++++++++++++++++++----
include/linux/iomap.h | 1 +
2 files changed, 34 insertions(+), 4 deletions(-)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 6937f10e2782..8965f603f2cf 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1119,6 +1119,14 @@ static ssize_t iomap_writethrough_complete(struct iomap_writethrough_ctx *wt_ctx
if (!ret) {
ret = wt_ctx->written;
iocb->ki_pos = wt_ctx->pos + ret;
+
+ /*
+ * If this is a DSYNC write and we couldn't optimize it, make
+ * sure we push it to stable storage now that we've written
+ * data.
+ */
+ if (iocb_is_dsync(wt_ctx->iocb) && !wt_ctx->use_fua)
+ ret = generic_write_sync(iocb, ret);
}
kfree(wt_ctx);
@@ -1173,6 +1181,7 @@ iomap_writethrough_submit_bio(struct iomap_writethrough_ctx *wt_ctx,
struct bio *bio;
unsigned int i;
u64 len = 0;
+ blk_opf_t opf = REQ_OP_WRITE;
if (!wt_ctx->nr_bvecs)
return;
@@ -1184,7 +1193,10 @@ iomap_writethrough_submit_bio(struct iomap_writethrough_ctx *wt_ctx,
wt_ops->writethrough_submit(wt_ctx->inode, iomap, wt_ctx->bio_pos,
len);
- bio = bio_alloc(iomap->bdev, wt_ctx->nr_bvecs, REQ_OP_WRITE, GFP_NOFS);
+ if (wt_ctx->use_fua)
+ opf |= REQ_FUA;
+
+ bio = bio_alloc(iomap->bdev, wt_ctx->nr_bvecs, opf, GFP_NOFS);
bio->bi_iter.bi_sector = iomap_sector(iomap, wt_ctx->bio_pos);
bio->bi_end_io = iomap_writethrough_bio_end_io;
bio->bi_private = wt_ctx;
@@ -1273,6 +1285,19 @@ static int iomap_writethrough_iter(struct iomap_writethrough_ctx *wt_ctx,
if (!(iter->flags & IOMAP_WRITETHROUGH))
return -EINVAL;
+ /*
+ * If we realise that cache flush is neccessary (eg FUA is not present
+ * or we need metadata updates) then we turn off the optimization.
+ */
+ if (wt_ctx->use_fua) {
+ if (iter->iomap.type != IOMAP_MAPPED ||
+ (iter->iomap.flags &
+ (IOMAP_F_NEW | IOMAP_F_SHARED | IOMAP_F_DIRTY)) ||
+ (bdev_write_cache(iter->iomap.bdev) &&
+ !bdev_fua(iter->iomap.bdev)))
+ wt_ctx->use_fua = false;
+ }
+
do {
struct folio *folio;
size_t offset; /* Offset into folio */
@@ -1545,9 +1570,6 @@ ssize_t iomap_file_writethrough_write(struct kiocb *iocb, struct iov_iter *i,
return -EINVAL;
if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_DONTCACHE))
return -EINVAL;
- if (iocb_is_dsync(iocb))
- /* D_SYNC support not implemented yet */
- return -EOPNOTSUPP;
/*
* +1 to max bvecs to account for unaligned write spanning multiple
@@ -1575,6 +1597,13 @@ ssize_t iomap_file_writethrough_write(struct kiocb *iocb, struct iov_iter *i,
wt_ctx->is_aio = !is_sync_kiocb(iocb);
atomic_set(&wt_ctx->ref, 1);
+ /*
+ * Similar to dio, we optimistically set use_fua=true to avoid explicit
+ * sync. In case we later realise cache flush is needed we set it back
+ * to false.
+ */
+ wt_ctx->use_fua = iocb_is_dsync(iocb) && !(iocb->ki_flags & IOCB_SYNC);
+
if (!wt_ctx->is_aio)
wt_ctx->waiter = current;
else
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index e99f7c279dc6..579bc48ed39c 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -487,6 +487,7 @@ struct iomap_writethrough_ctx {
unsigned int flags;
int error;
bool is_aio;
+ bool use_fua;
union {
/* used during submission and for non-aio completion */
--
2.53.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-04-08 18:46 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-08 18:45 [RFC PATCH v2 0/5] Add buffered write-through support to iomap & xfs Ojaswin Mujoo
2026-04-08 18:45 ` [RFC PATCH v2 1/5] mm: Refactor folio_clear_dirty_for_io() Ojaswin Mujoo
2026-04-08 18:45 ` [RFC PATCH v2 2/5] iomap: Add initial support for buffered RWF_WRITETHROUGH Ojaswin Mujoo
2026-04-08 18:45 ` [RFC PATCH v2 3/5] xfs: Add RWF_WRITETHROUGH support to xfs Ojaswin Mujoo
2026-04-08 18:45 ` [RFC PATCH v2 4/5] iomap: Add aio support to RWF_WRITETHROUGH Ojaswin Mujoo
2026-04-08 18:45 ` [RFC PATCH v2 5/5] iomap: Add DSYNC support to writethrough Ojaswin Mujoo
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox