From: Dave Chinner <dgc@kernel.org>
To: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Cc: linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
djwong@kernel.org, john.g.garry@oracle.com, willy@infradead.org,
hch@lst.de, ritesh.list@gmail.com, jack@suse.cz,
Luis Chamberlain <mcgrof@kernel.org>,
tytso@mit.edu, p.raghav@samsung.com, andres@anarazel.de,
linux-kernel@vger.kernel.org
Subject: Re: [RFC 1/3] iomap: Support buffered RWF_WRITETHROUGH via async dio backend
Date: Wed, 11 Mar 2026 23:05:05 +1100 [thread overview]
Message-ID: <abFacQlKfQj3aQZk@dread> (raw)
In-Reply-To: <abFFcSTmUaAzbmOB@li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com>
On Wed, Mar 11, 2026 at 04:05:29PM +0530, Ojaswin Mujoo wrote:
> On Tue, Mar 10, 2026 at 05:48:12PM +1100, Dave Chinner wrote:
> > On Mon, Mar 09, 2026 at 11:04:31PM +0530, Ojaswin Mujoo wrote:
> > This is not what I envisiaged write-through using DIO to look like.
> > This is a DIO per folio, rather than a DIO per write() syscall. We
> > want the latter to be the common case, not the former, especially
> > when it comes to RWF_ATOMIC support.
> >
> > i.e. I was expecting something more like having a wt context
> > allocated up front with an appropriately sized bvec appended to it
> > (i.e. single allocation for the common case). Then in
> > iomap_write_end(), we'd mark the folio as under writeback and add it
> > to the bvec. Then we iterate through the IO range adding folio after
> > folio to the bvec.
> >
> > When the bvec is full or we reach the end of the IO, we then push
> > that bvec down to the DIO code. Ideally we'd also push the iomap we
> > already hold down as well, so that the DIO code does not need to
> > look it up again (unless the mapping is stale). The DIO completion
> > callback then runs a completion callback that iterates the folios
> > attached ot the bvec and runs buffered writeback compeltion on them.
> > It can then decrements the wt-ctx IO-in-flight counter.
> >
> > If there is more user data to submit, we keep going around (with a
> > new bvec if we need it) adding folios and submitting them to the dio
> > code until there is no more data to copy in and submit.
> >
> > The writethrough context then drops it's own "in-flight" reference
> > and waits for the in-flight counter to go to zero.
>
> Hi Dave,
>
> Thanks for the review. IIUC you are suggesting a per iomap submission of
> dio rather than a per folio,
Yes, this is the original architectural premise of iomap: we map the
extent first, then iterate over folios, then submit a single bio for
the extent...
> and for each iomap we submit we can
> maintain a per writethrough counter which helps us perform any sort of
> endio cleanup work. I can give this design a try in v2.
Yes, this is exactly how iomap DIO completion tracking works for
IO that requires multiple bios to be submitted. i.e. completion
processing only runs once all IOs -and submission- have completed.
> > > index c24d94349ca5..f4d8ff08a83a 100644
> > > --- a/fs/iomap/direct-io.c
> > > +++ b/fs/iomap/direct-io.c
> > > @@ -713,7 +713,8 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> > > dio->i_size = i_size_read(inode);
> > > dio->dops = dops;
> > > dio->error = 0;
> > > - dio->flags = dio_flags & (IOMAP_DIO_FSBLOCK_ALIGNED | IOMAP_DIO_BOUNCE);
> > > + dio->flags = dio_flags & (IOMAP_DIO_FSBLOCK_ALIGNED | IOMAP_DIO_BOUNCE |
> > > + IOMAP_DIO_BUF_WRITETHROUGH);
> > > dio->done_before = done_before;
> > >
> > > dio->submit.iter = iter;
> > > @@ -747,8 +748,13 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> > > if (iocb->ki_flags & IOCB_ATOMIC)
> > > iomi.flags |= IOMAP_ATOMIC;
> > >
> > > - /* for data sync or sync, we need sync completion processing */
> > > - if (iocb_is_dsync(iocb)) {
> > > + /*
> > > + * for data sync or sync, we need sync completion processing.
> > > + * for buffered writethrough, sync is handled in buffered IO
> > > + * path so not needed here
> > > + */
> > > + if (iocb_is_dsync(iocb) &&
> > > + !(dio->flags & IOMAP_DIO_BUF_WRITETHROUGH)) {
> > > dio->flags |= IOMAP_DIO_NEED_SYNC;
> >
> > Ah, that looks wrong. We want writethrough to be able to use FUA
> > optimisations for RWF_DSYNC. This prevents the DIO write for wt from
> > setting IOMAP_DIO_WRITE_THROUGH which is needed to trigger FUA
> > writes for RWF_DSYNC ops.
> >
> > i.e. we need DIO to handle the write completions directly to allow
> > conditional calling of generic_write_sync() based on whether FUA
> > writes were used or not.
>
> Yes right, for now we just let xfs_file_buffered_write() ->
> generic_write_sync() to handle the sync because first we wanted to have
> some discussion on how to correctly implement optimized
> O_DSYNC/RWF_DSYNC.
Ah, I had assumed that discussion was largely unnecessary because it
was obvious to me how to implement writethrough behaviour. i.e. all
we need to do is replicate the iomap DIO internal
submission/completion model for wt around the outside of the async
DIO write submission loop, and we are largely done.
> Some open questions that I have right now:
>
> 1. For non-aio non FUA writethrough, where is the right place to do the
> sync?
At the end of final wt ctx IO completion, just like we do for DIO.
> We can't simply rely on iomap_dio_complete() to do the sync
> since we still hold writeback bit and that causes a deadlock.
Right, you do it at wt ctx IO completion after all the folios in the
range have been marked clean. At that point, all that remains is for
the device cache to be flushed and the metadata sync operations to
be performed.
i.e. This is exactly the same integrity requirement as non-FUA
RWF_DSYNC DIO.
> Even if
> we solve that, we need to have a way to propogate any fsync errors
> back to user so endio might not be the right place anyways?
It's the same model as DIO. If we have async submission and the IO
is not complete, we return -EIOCBQUEUED. Otherwise we gather the
error from the completed wt ctx and return that.
> 2. For non-aio writethrough, if we do want to do the sync via
> xfs_file_buffered_write() -> generic_write_sync(),
We don't want to do that. This crappy "caller submits and waits for
IO" model is the primary reasons we can't do async buffered
RWF_DSYNC.
The sync operation needs to be run at completion processing. If we
are not doing AIO, then the submitter waits for all submitted IOs to
complete, then runs completion processing itself. IO errors are
collected directly and returned to the submitter.
If we are doing AIO, the the submitter drops it's IO reference, and
then the final IO completion that runs will execute the sync
operation, and the result is returned to the AIO completion ring
via the iocb->ki_complete() callback.
This is exactly the same model as the iomap DIO code - it's lifted
up a layer to the buffered WT layer and wrapped around multiple
async DIO submissions...
> `we need a way to
> propogate IOMAP_DIO_WRITE_THROUGH to the higher level so that we can
> skip the sync.
The sync disappears completely from the higher layers - to take
advantage of FUA optimisations, the sync operations need to be
handled by the WT code. i.e. Buffered DSYNC or OSYNC writes are
-always- write-through operations after this infrastructure is put
in place, they are never run by high level code like
xfs_file_buffered_write().
Indeed, do you see generic_write_sync() calls in the XFS DIO write
paths?
> Also, a naive question, usually DSYNC means that by the time the
> syscall returns we'd either know data has reached the medium or we will
> get an error. Even in aio context I think we respect this semantic
> currently.
For a write() style syscall, yes. For AIO/io_uring, no.
io_submit() only returns an error if there is something wrong
with the aio ctx or iocbs being submitted. It does not report
completion status of the iocbs that are submitted. You need to call
io_getevents() to obtain the completion status of individual iocbs
that have been submitted via io_submit().
Think about it: if you submit 16 IO in on io_submit() call and
one fails, how do you know find out which IO failed?
> However, with our idea of making the DSYNC buffered aio also
> truly async, via writethrough, won't we be violating this guarantee?
No, the error will be returned to the AIO completion ring, same as
it is now.
-Dave.
--
Dave Chinner
dgc@kernel.org
next prev parent reply other threads:[~2026-03-11 12:05 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-09 17:34 [RFC 0/3] Add buffered write-through support to iomap & xfs Ojaswin Mujoo
2026-03-09 17:34 ` [RFC 1/3] iomap: Support buffered RWF_WRITETHROUGH via async dio backend Ojaswin Mujoo
2026-03-10 6:48 ` Dave Chinner
2026-03-11 10:35 ` Ojaswin Mujoo
2026-03-11 12:05 ` Dave Chinner [this message]
2026-03-13 7:43 ` Ojaswin Mujoo
2026-03-09 17:34 ` [RFC 2/3] iomap: Enable stable writes for RWF_WRITETHROUGH inodes Ojaswin Mujoo
2026-03-10 3:57 ` Darrick J. Wong
2026-03-10 5:25 ` Ritesh Harjani
2026-03-11 6:27 ` Ojaswin Mujoo
2026-03-09 17:34 ` [RFC 3/3] xfs: Add RWF_WRITETHROUGH support to xfs Ojaswin Mujoo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=abFacQlKfQj3aQZk@dread \
--to=dgc@kernel.org \
--cc=andres@anarazel.de \
--cc=djwong@kernel.org \
--cc=hch@lst.de \
--cc=jack@suse.cz \
--cc=john.g.garry@oracle.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
--cc=mcgrof@kernel.org \
--cc=ojaswin@linux.ibm.com \
--cc=p.raghav@samsung.com \
--cc=ritesh.list@gmail.com \
--cc=tytso@mit.edu \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox