From: Brian Foster <bfoster@redhat.com>
To: "Darrick J. Wong" <djwong@kernel.org>
Cc: linux-fsdevel@vger.kernel.org, linux-xfs@vger.kernel.org
Subject: Re: [PATCH 2/6] iomap, xfs: lift zero range hole mapping flush into xfs
Date: Thu, 6 Nov 2025 10:52:34 -0500 [thread overview]
Message-ID: <aQzEQtynNsJLdLcD@bfoster> (raw)
In-Reply-To: <20251105222350.GO196362@frogsfrogsfrogs>
On Wed, Nov 05, 2025 at 02:23:50PM -0800, Darrick J. Wong wrote:
> On Wed, Nov 05, 2025 at 10:33:16AM -0500, Brian Foster wrote:
> > On Tue, Nov 04, 2025 at 04:31:14PM -0800, Darrick J. Wong wrote:
> > > On Thu, Oct 16, 2025 at 03:02:59PM -0400, Brian Foster wrote:
> > > > iomap zero range has a wart in that it also flushes dirty pagecache
> > > > over hole mappings (rather than only unwritten mappings). This was
> > > > included to accommodate a quirk in XFS where COW fork preallocation
> > > > can exist over a hole in the data fork, and the associated range is
> > > > reported as a hole. This is because the range actually is a hole,
> > > > but XFS also has an optimization where if COW fork blocks exist for
> > > > a range being written to, those blocks are used regardless of
> > > > whether the data fork blocks are shared or not. For zeroing, COW
> > > > fork blocks over a data fork hole are only relevant if the range is
> > > > dirty in pagecache, otherwise the range is already considered
> > > > zeroed.
> > >
> > > It occurs to me that the situation (unwritten cow mapping, hole in data
> > > fork) results in iomap_iter::iomap getting the unwritten mapping, and
> > > iomap_iter::srcmap getting the hole mapping. iomap_iter_srcmap returns
> > > iomap_itere::iomap because srcmap.type == HOLE.
> > >
> > > But then you have ext4 where there is no cow fork, so it will only ever
> > > set iomap_iter::iomap, leaving iomap_iter::srcmap set to the default.
> > > The default srcmap is a HOLE.
> > >
> > > So iomap can't distinguish between xfs' speculative cow over a hole
> > > behavior vs. ext4 just being simple. I wonder if we actually need to
> > > introduce a new iomap type for "pure overwrite"?
> > >
> >
> > I definitely think we need a better solution here in iomap. The current
> > iomap/srcmap management/handling is quite confusing. What that solution
> > is, I'm not sure.
> >
> > > The reason I say that that in designing the fuse-iomap uapi, it was a
> > > lot easier to understand the programming model if there was always
> > > explicit read and write mappings being sent back and forth; and a new
> > > type FUSE_IOMAP_TYPE_PURE_OVERWRITE that could be stored in the write
> > > mapping to mean "just look at the read mapping". If such a beast were
> > > ported to the core iomap code then maybe that would help here?
> > >
> >
> > I'm not following what this means. Separate read/write mappings for each
> > individual iomap operation (i.e. "read from here, write to there"), or
> > separate iomap structures to be used for read ops vs. write ops, or
> > something else..?
>
> "read from here, write to there".
>
Ok..
> First, we move IOMAP_HOLE up one:
>
> #define IOMAP_NULL 0 /* no mapping here at all */
> #define IOMAP_HOLE 1 /* no blocks allocated, need allocation */
> #define IOMAP_DELALLOC 2 /* delayed allocation blocks */
> ...
>
> and do some renaming:
>
> struct iomap_iter {
> struct inode *inode;
> loff_t pos;
> u64 len;
> loff_t iter_start_pos;
> int status;
> unsigned flags;
> struct iomap write_map;
> struct iomap read_map;
> void *private;
> };
>
> Then we change the interface so that ->iomap_begin always sets read_map
> to a mapping from which file data can be read, and write_map is always
> set to a mapping into which file data can be written. If a filesystem
> doesn't support out of place writes, then it can ignore write_map and
> write_map.type will be IOMAP_NULL.
>
> (Obviously the fs always has to supply a read mapping)
>
> The read operations (e.g. readahead, fiemap) only ever pay attention to
> what the filesystem supplies in iomap_iter::read_map.
>
> An unaligned pagecache write to an uncached region uses the read mapping
> to pull data into the pagecache. For writeback, we'd use the write
> mapping if it's non-null, or else the read mapping.
>
Or perhaps let the read/write mappings overlap? It's not clear to me if
that's better or worse. ;P
> This might not move the needle much wrt to fixing your problem, but at
> least it eliminates the weirdness around "@iomap is for reads except
> when you're doing a write but you have to do a read *and* @srcmap isn't
> a hole".
>
Yeah.. it might be reaching a pedantic level, but to me having a couple
mappings that say "you can read from this range, write to that range,
and they might be the same" is more clear than the srcmap/dstmap/maybe
both logic we have today.
I'd be a little concerned about having similar complexity along the
lines of "read from here, write to there, except maybe sometimes write
to where you read from" in iomap if we could just make the logic dead
simple/consistent and give the fs helpers to fill things out properly.
But again this is my first exposure to the idea and so I'm thinking out
loud and probably missing some details..
> > > A hole with an out-of-place mapping needs a flush (or maybe just go find
> > > the pagecache and zero it), whereas a hole with nothing else backing it
> > > clearly doesn't need any action at all.
> > >
> > > Does that help?
> > >
> >
> > This kind of sounds like what we're already doing in iomap, so I suspect
> > I'm missing something on the fuse side...
> >
> > WRT this patchset, I'm trying to address the underlying problems that
> > require the flush-a-dirty-hole hack that provides zeroing correctness
> > for XFS. This needs to be lifted out of iomap because it also causes
> > problems for ext4, but can be bypassed completely for XFS as well by the
> > end of the series. The first part is just a straight lift into xfs, but
> > the next patches replace the flush with use of the folio batch, and
> > split off the band-aid case down into insert range where this flush was
> > also indirectly suppressing issues.
> >
> > For the former, if you look at the last few patches the main reason we
> > rely on the flush-a-dirty-hole hack is that we don't actually report the
> > mappings correctly in XFS for zero range with respect to allowable
> > behavior. We just report a hole if one exists in the data fork. So this
> > is trying to encode that "COW fork prealloc over data fork hole"
> > scenario correctly for zero range, and also identify when we need to
> > consider whether the mapping range is dirty in cache (i.e. unwritten COW
> > blocks).
>
> Ahh, ok. I think I get what you're saying now -- instead of doddering
> around in iomap to make it know the difference between "prealloc in cow,
> hole in data fork" vs. "hole in data fork", you'd rather try to
> accumulate folios in the folio_batch, hand that to iomap, and then iomap
> will just zero folios in the batch. No need for the preflush or deep
> surgery to core code.
>
Yeah, I don't want to preclude potential mapping improvements, but I'm
also not trying to solve that here just for zero range.
I suppose we could consider changing this particular case to also report
the cow mapping (similar to buffered write IIRC) and maybe that would
reduce some quirkiness, but I'd have to think harder and test that out.
It might probably still be a separate patch just because I'm hesitant to
change too much at once and hurt bisectability.
> > So yes in general I think we need to improve on iomap reporting somehow,
> > but I don't necessarily see how that avoids the need (or desire) to fix
> > up the iomap_begin logic. I also think it's confusing enough that it
> > should probably be a separate discussion (I'd probably need to stare at
> > the fuse-related proposition to grok it).
> >
> > Ultimately the flush in zero range should go away completely except for
> > the default/fallback case where the fs supports zero range, fails to
> > check pagecache itself, and iomap has otherwise detected that the range
> > over an unwritten mapping was dirty. There has been some discussion over
> > potentially lifting the batch lookup into iomap as well, but there are
> > some details that would need to be worked out to determine whether that
> > can be done safely.
>
> <nod> ok I'll keep reading this series.
>
Thanks.
Brian
> --D
>
> > Brian
> >
> > > --D
> > >
> > > > The easiest way to deal with this corner case is to flush the
> > > > pagecache to trigger COW remapping into the data fork, and then
> > > > operate on the updated on-disk state. The problem is that ext4
> > > > cannot accommodate a flush from this context due to being a
> > > > transaction deadlock vector.
> > > >
> > > > Outside of the hole quirk, ext4 can avoid the flush for zero range
> > > > by using the recently introduced folio batch lookup mechanism for
> > > > unwritten mappings. Therefore, take the next logical step and lift
> > > > the hole handling logic into the XFS iomap_begin handler. iomap will
> > > > still flush on unwritten mappings without a folio batch, and XFS
> > > > will flush and retry mapping lookups in the case where it would
> > > > otherwise report a hole with dirty pagecache during a zero range.
> > > >
> > > > Note that this is intended to be a fairly straightforward lift and
> > > > otherwise not change behavior. Now that the flush exists within XFS,
> > > > follow on patches can further optimize it.
> > > >
> > > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > > ---
> > > > fs/iomap/buffered-io.c | 2 +-
> > > > fs/xfs/xfs_iomap.c | 25 ++++++++++++++++++++++---
> > > > 2 files changed, 23 insertions(+), 4 deletions(-)
> > > >
> > > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > > > index 05ff82c5432e..d6de689374c3 100644
> > > > --- a/fs/iomap/buffered-io.c
> > > > +++ b/fs/iomap/buffered-io.c
> > > > @@ -1543,7 +1543,7 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
> > > > srcmap->type == IOMAP_UNWRITTEN)) {
> > > > s64 status;
> > > >
> > > > - if (range_dirty) {
> > > > + if (range_dirty && srcmap->type == IOMAP_UNWRITTEN) {
> > > > range_dirty = false;
> > > > status = iomap_zero_iter_flush_and_stale(&iter);
> > > > } else {
> > > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > > > index 01833aca37ac..b84c94558cc9 100644
> > > > --- a/fs/xfs/xfs_iomap.c
> > > > +++ b/fs/xfs/xfs_iomap.c
> > > > @@ -1734,6 +1734,7 @@ xfs_buffered_write_iomap_begin(
> > > > if (error)
> > > > return error;
> > > >
> > > > +restart:
> > > > error = xfs_ilock_for_iomap(ip, flags, &lockmode);
> > > > if (error)
> > > > return error;
> > > > @@ -1761,9 +1762,27 @@ xfs_buffered_write_iomap_begin(
> > > > if (eof)
> > > > imap.br_startoff = end_fsb; /* fake hole until the end */
> > > >
> > > > - /* We never need to allocate blocks for zeroing or unsharing a hole. */
> > > > - if ((flags & (IOMAP_UNSHARE | IOMAP_ZERO)) &&
> > > > - imap.br_startoff > offset_fsb) {
> > > > + /* We never need to allocate blocks for unsharing a hole. */
> > > > + if ((flags & IOMAP_UNSHARE) && imap.br_startoff > offset_fsb) {
> > > > + xfs_hole_to_iomap(ip, iomap, offset_fsb, imap.br_startoff);
> > > > + goto out_unlock;
> > > > + }
> > > > +
> > > > + /*
> > > > + * We may need to zero over a hole in the data fork if it's fronted by
> > > > + * COW blocks and dirty pagecache. To make sure zeroing occurs, force
> > > > + * writeback to remap pending blocks and restart the lookup.
> > > > + */
> > > > + if ((flags & IOMAP_ZERO) && imap.br_startoff > offset_fsb) {
> > > > + if (filemap_range_needs_writeback(inode->i_mapping, offset,
> > > > + offset + count - 1)) {
> > > > + xfs_iunlock(ip, lockmode);
> > > > + error = filemap_write_and_wait_range(inode->i_mapping,
> > > > + offset, offset + count - 1);
> > > > + if (error)
> > > > + return error;
> > > > + goto restart;
> > > > + }
> > > > xfs_hole_to_iomap(ip, iomap, offset_fsb, imap.br_startoff);
> > > > goto out_unlock;
> > > > }
> > > > --
> > > > 2.51.0
> > > >
> > > >
> > >
> >
> >
>
next prev parent reply other threads:[~2025-11-06 15:48 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-16 19:02 [PATCH 0/6] iomap, xfs: improve zero range flushing and lookup Brian Foster
2025-10-16 19:02 ` [PATCH 1/6] iomap: replace folio_batch allocation with stack allocation Brian Foster
2025-11-05 0:07 ` Darrick J. Wong
2025-11-05 15:27 ` Brian Foster
2025-11-05 21:41 ` Darrick J. Wong
2025-11-06 15:51 ` Brian Foster
2025-11-06 15:58 ` Darrick J. Wong
2025-10-16 19:02 ` [PATCH 2/6] iomap, xfs: lift zero range hole mapping flush into xfs Brian Foster
2025-11-05 0:31 ` Darrick J. Wong
2025-11-05 15:33 ` Brian Foster
2025-11-05 22:23 ` Darrick J. Wong
2025-11-06 15:52 ` Brian Foster [this message]
2025-11-06 23:32 ` Darrick J. Wong
2025-11-07 13:52 ` Brian Foster
2025-11-07 13:59 ` Christoph Hellwig
2025-11-07 13:57 ` Christoph Hellwig
2025-11-07 13:55 ` Christoph Hellwig
2025-10-16 19:03 ` [PATCH 3/6] xfs: flush eof folio before insert range size update Brian Foster
2025-11-05 0:14 ` Darrick J. Wong
2025-11-05 15:34 ` Brian Foster
2025-11-05 22:15 ` Darrick J. Wong
2025-10-16 19:03 ` [PATCH 4/6] xfs: look up cow fork extent earlier for buffered iomap_begin Brian Foster
2025-11-05 22:26 ` Darrick J. Wong
2025-10-16 19:03 ` [PATCH 5/6] xfs: only flush when COW fork blocks overlap data fork holes Brian Foster
2025-10-16 19:03 ` [PATCH 6/6] xfs: replace zero range flush with folio batch Brian Foster
2025-11-05 22:37 ` Darrick J. Wong
2025-11-06 15:53 ` Brian Foster
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aQzEQtynNsJLdLcD@bfoster \
--to=bfoster@redhat.com \
--cc=djwong@kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).