* [PATCH 0/6] iomap, xfs: improve zero range flushing and lookup
@ 2025-10-16 19:02 Brian Foster
2025-10-16 19:02 ` [PATCH 1/6] iomap: replace folio_batch allocation with stack allocation Brian Foster
` (5 more replies)
0 siblings, 6 replies; 27+ messages in thread
From: Brian Foster @ 2025-10-16 19:02 UTC (permalink / raw)
To: linux-fsdevel, linux-xfs
Hi all,
Now that the folio_batch bits are headed into -next here is the next
phase of cleanups. Most of this has been previously discussed at one
point or another. Zhang Yi had reported a couple outstanding issues
related to the conversion of ext4 over to iomap. One had to do with the
context of the folio_batch dynamic allocation and another is the flush
in the the non-unwritten mapping case can cause problems.
This series attempts to address the former issue in patch 1 by using a
stack allocated folio_batch and iomap flag, eliminating the need for the
dynamic allocation. The non-unwritten flush case only exists as a
band-aid for wonky XFS behavior, so patches 2-6 lift this logic into XFS
and work on it from there. Ultimately, the flush is relocated to insert
range where it appears to be needed and the iomap begin usage is
replaced with another use of the folio batch mechanism.
This has survived testing so far on XFS in a handful of different
configs and arches. WRT patch 3, I would have liked to reorder the
existing insert range truncate and flush in either direction rather than
introduce a new flush just for EOF, but neither seemed obviously clean
enough to me as I was looking at it with the current code factoring. So
rather than go back and forth on that on my own I opted to keep the
patch simple to start and maybe see what the folks on the XFS list
think.
Note that this applies on top of the pending folio_batch series [1].
Thoughts, reviews, flames appreciated.
Brian
[1] https://lore.kernel.org/linux-fsdevel/20251003134642.604736-1-bfoster@redhat.com/
Brian Foster (6):
iomap: replace folio_batch allocation with stack allocation
iomap, xfs: lift zero range hole mapping flush into xfs
xfs: flush eof folio before insert range size update
xfs: look up cow fork extent earlier for buffered iomap_begin
xfs: only flush when COW fork blocks overlap data fork holes
xfs: replace zero range flush with folio batch
fs/iomap/buffered-io.c | 49 +++++++++++++--------
fs/iomap/iter.c | 6 +--
fs/xfs/xfs_file.c | 17 ++++++++
fs/xfs/xfs_iomap.c | 98 +++++++++++++++++++++++++++++-------------
include/linux/iomap.h | 8 +++-
5 files changed, 125 insertions(+), 53 deletions(-)
--
2.51.0
^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 1/6] iomap: replace folio_batch allocation with stack allocation
2025-10-16 19:02 [PATCH 0/6] iomap, xfs: improve zero range flushing and lookup Brian Foster
@ 2025-10-16 19:02 ` Brian Foster
2025-11-05 0:07 ` Darrick J. Wong
2025-10-16 19:02 ` [PATCH 2/6] iomap, xfs: lift zero range hole mapping flush into xfs Brian Foster
` (4 subsequent siblings)
5 siblings, 1 reply; 27+ messages in thread
From: Brian Foster @ 2025-10-16 19:02 UTC (permalink / raw)
To: linux-fsdevel, linux-xfs
Zhang Yi points out that the dynamic folio_batch allocation in
iomap_fill_dirty_folios() is problematic for the ext4 on iomap work
that is under development because it doesn't sufficiently handle the
allocation failure case (by allowing a retry, for example).
The dynamic allocation was initially added for simplicity and to
help indicate whether the batch was used or not by the calling fs.
To address this issue, put the batch on the stack of
iomap_zero_range() and use a flag to control whether the batch
should be used in the iomap folio lookup path. This keeps things
simple and eliminates the concern for ext4 on iomap.
Signed-off-by: Brian Foster <bfoster@redhat.com>
---
fs/iomap/buffered-io.c | 45 ++++++++++++++++++++++++++++--------------
fs/iomap/iter.c | 6 +++---
fs/xfs/xfs_iomap.c | 11 ++++++-----
include/linux/iomap.h | 8 ++++++--
4 files changed, 45 insertions(+), 25 deletions(-)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 51ecb6d48feb..05ff82c5432e 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -761,7 +761,7 @@ static struct folio *__iomap_get_folio(struct iomap_iter *iter,
if (!mapping_large_folio_support(iter->inode->i_mapping))
len = min_t(size_t, len, PAGE_SIZE - offset_in_page(pos));
- if (iter->fbatch) {
+ if (iter->iomap.flags & IOMAP_F_FOLIO_BATCH) {
struct folio *folio = folio_batch_next(iter->fbatch);
if (!folio)
@@ -858,7 +858,7 @@ static int iomap_write_begin(struct iomap_iter *iter,
* process so return and let the caller iterate and refill the batch.
*/
if (!folio) {
- WARN_ON_ONCE(!iter->fbatch);
+ WARN_ON_ONCE(!(iter->iomap.flags & IOMAP_F_FOLIO_BATCH));
return 0;
}
@@ -1473,23 +1473,34 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero,
return status;
}
-loff_t
+/**
+ * iomap_fill_dirty_folios - fill a folio batch with dirty folios
+ * @iter: Iteration structure
+ * @start: Start offset of range. Updated based on lookup progress.
+ * @end: End offset of range
+ *
+ * Returns the associated control flag if the folio batch is available and the
+ * lookup performed. The caller is responsible to set the flag on the associated
+ * iomap.
+ */
+unsigned int
iomap_fill_dirty_folios(
struct iomap_iter *iter,
- loff_t offset,
- loff_t length)
+ loff_t *start,
+ loff_t end)
{
struct address_space *mapping = iter->inode->i_mapping;
- pgoff_t start = offset >> PAGE_SHIFT;
- pgoff_t end = (offset + length - 1) >> PAGE_SHIFT;
+ pgoff_t pstart = *start >> PAGE_SHIFT;
+ pgoff_t pend = (end - 1) >> PAGE_SHIFT;
- iter->fbatch = kmalloc(sizeof(struct folio_batch), GFP_KERNEL);
- if (!iter->fbatch)
- return offset + length;
- folio_batch_init(iter->fbatch);
+ if (!iter->fbatch) {
+ *start = end;
+ return 0;
+ }
- filemap_get_folios_dirty(mapping, &start, end, iter->fbatch);
- return (start << PAGE_SHIFT);
+ filemap_get_folios_dirty(mapping, &pstart, pend, iter->fbatch);
+ *start = (pstart << PAGE_SHIFT);
+ return IOMAP_F_FOLIO_BATCH;
}
EXPORT_SYMBOL_GPL(iomap_fill_dirty_folios);
@@ -1498,17 +1509,21 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
const struct iomap_ops *ops,
const struct iomap_write_ops *write_ops, void *private)
{
+ struct folio_batch fbatch;
struct iomap_iter iter = {
.inode = inode,
.pos = pos,
.len = len,
.flags = IOMAP_ZERO,
.private = private,
+ .fbatch = &fbatch,
};
struct address_space *mapping = inode->i_mapping;
int ret;
bool range_dirty;
+ folio_batch_init(&fbatch);
+
/*
* To avoid an unconditional flush, check pagecache state and only flush
* if dirty and the fs returns a mapping that might convert on
@@ -1519,11 +1534,11 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
while ((ret = iomap_iter(&iter, ops)) > 0) {
const struct iomap *srcmap = iomap_iter_srcmap(&iter);
- if (WARN_ON_ONCE(iter.fbatch &&
+ if (WARN_ON_ONCE((iter.iomap.flags & IOMAP_F_FOLIO_BATCH) &&
srcmap->type != IOMAP_UNWRITTEN))
return -EIO;
- if (!iter.fbatch &&
+ if (!(iter.iomap.flags & IOMAP_F_FOLIO_BATCH) &&
(srcmap->type == IOMAP_HOLE ||
srcmap->type == IOMAP_UNWRITTEN)) {
s64 status;
diff --git a/fs/iomap/iter.c b/fs/iomap/iter.c
index 66ca12aac57d..026d85823c76 100644
--- a/fs/iomap/iter.c
+++ b/fs/iomap/iter.c
@@ -8,10 +8,10 @@
static inline void iomap_iter_reset_iomap(struct iomap_iter *iter)
{
- if (iter->fbatch) {
+ if (iter->iomap.flags & IOMAP_F_FOLIO_BATCH) {
folio_batch_release(iter->fbatch);
- kfree(iter->fbatch);
- iter->fbatch = NULL;
+ folio_batch_reinit(iter->fbatch);
+ iter->iomap.flags &= ~IOMAP_F_FOLIO_BATCH;
}
iter->status = 0;
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 535bf3b8705d..01833aca37ac 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1775,7 +1775,6 @@ xfs_buffered_write_iomap_begin(
*/
if (flags & IOMAP_ZERO) {
xfs_fileoff_t eof_fsb = XFS_B_TO_FSB(mp, XFS_ISIZE(ip));
- u64 end;
if (isnullstartblock(imap.br_startblock) &&
offset_fsb >= eof_fsb)
@@ -1795,12 +1794,14 @@ xfs_buffered_write_iomap_begin(
*/
if (imap.br_state == XFS_EXT_UNWRITTEN &&
offset_fsb < eof_fsb) {
- loff_t len = min(count,
- XFS_FSB_TO_B(mp, imap.br_blockcount));
+ loff_t foffset = offset, fend;
- end = iomap_fill_dirty_folios(iter, offset, len);
+ fend = offset +
+ min(count, XFS_FSB_TO_B(mp, imap.br_blockcount));
+ iomap_flags |= iomap_fill_dirty_folios(iter, &foffset,
+ fend);
end_fsb = min_t(xfs_fileoff_t, end_fsb,
- XFS_B_TO_FSB(mp, end));
+ XFS_B_TO_FSB(mp, foffset));
}
xfs_trim_extent(&imap, offset_fsb, end_fsb - offset_fsb);
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index cd0f573156d6..79da917ff45e 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -87,6 +87,9 @@ struct vm_fault;
/*
* Flags set by the core iomap code during operations:
*
+ * IOMAP_F_FOLIO_BATCH indicates that the folio batch mechanism is active
+ * for this operation, set by iomap_fill_dirty_folios().
+ *
* IOMAP_F_SIZE_CHANGED indicates to the iomap_end method that the file size
* has changed as the result of this write operation.
*
@@ -94,6 +97,7 @@ struct vm_fault;
* range it covers needs to be remapped by the high level before the operation
* can proceed.
*/
+#define IOMAP_F_FOLIO_BATCH (1U << 13)
#define IOMAP_F_SIZE_CHANGED (1U << 14)
#define IOMAP_F_STALE (1U << 15)
@@ -351,8 +355,8 @@ bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio);
int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
const struct iomap_ops *ops,
const struct iomap_write_ops *write_ops);
-loff_t iomap_fill_dirty_folios(struct iomap_iter *iter, loff_t offset,
- loff_t length);
+unsigned int iomap_fill_dirty_folios(struct iomap_iter *iter, loff_t *start,
+ loff_t end);
int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len,
bool *did_zero, const struct iomap_ops *ops,
const struct iomap_write_ops *write_ops, void *private);
--
2.51.0
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH 2/6] iomap, xfs: lift zero range hole mapping flush into xfs
2025-10-16 19:02 [PATCH 0/6] iomap, xfs: improve zero range flushing and lookup Brian Foster
2025-10-16 19:02 ` [PATCH 1/6] iomap: replace folio_batch allocation with stack allocation Brian Foster
@ 2025-10-16 19:02 ` Brian Foster
2025-11-05 0:31 ` Darrick J. Wong
2025-10-16 19:03 ` [PATCH 3/6] xfs: flush eof folio before insert range size update Brian Foster
` (3 subsequent siblings)
5 siblings, 1 reply; 27+ messages in thread
From: Brian Foster @ 2025-10-16 19:02 UTC (permalink / raw)
To: linux-fsdevel, linux-xfs
iomap zero range has a wart in that it also flushes dirty pagecache
over hole mappings (rather than only unwritten mappings). This was
included to accommodate a quirk in XFS where COW fork preallocation
can exist over a hole in the data fork, and the associated range is
reported as a hole. This is because the range actually is a hole,
but XFS also has an optimization where if COW fork blocks exist for
a range being written to, those blocks are used regardless of
whether the data fork blocks are shared or not. For zeroing, COW
fork blocks over a data fork hole are only relevant if the range is
dirty in pagecache, otherwise the range is already considered
zeroed.
The easiest way to deal with this corner case is to flush the
pagecache to trigger COW remapping into the data fork, and then
operate on the updated on-disk state. The problem is that ext4
cannot accommodate a flush from this context due to being a
transaction deadlock vector.
Outside of the hole quirk, ext4 can avoid the flush for zero range
by using the recently introduced folio batch lookup mechanism for
unwritten mappings. Therefore, take the next logical step and lift
the hole handling logic into the XFS iomap_begin handler. iomap will
still flush on unwritten mappings without a folio batch, and XFS
will flush and retry mapping lookups in the case where it would
otherwise report a hole with dirty pagecache during a zero range.
Note that this is intended to be a fairly straightforward lift and
otherwise not change behavior. Now that the flush exists within XFS,
follow on patches can further optimize it.
Signed-off-by: Brian Foster <bfoster@redhat.com>
---
fs/iomap/buffered-io.c | 2 +-
fs/xfs/xfs_iomap.c | 25 ++++++++++++++++++++++---
2 files changed, 23 insertions(+), 4 deletions(-)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 05ff82c5432e..d6de689374c3 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1543,7 +1543,7 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
srcmap->type == IOMAP_UNWRITTEN)) {
s64 status;
- if (range_dirty) {
+ if (range_dirty && srcmap->type == IOMAP_UNWRITTEN) {
range_dirty = false;
status = iomap_zero_iter_flush_and_stale(&iter);
} else {
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 01833aca37ac..b84c94558cc9 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1734,6 +1734,7 @@ xfs_buffered_write_iomap_begin(
if (error)
return error;
+restart:
error = xfs_ilock_for_iomap(ip, flags, &lockmode);
if (error)
return error;
@@ -1761,9 +1762,27 @@ xfs_buffered_write_iomap_begin(
if (eof)
imap.br_startoff = end_fsb; /* fake hole until the end */
- /* We never need to allocate blocks for zeroing or unsharing a hole. */
- if ((flags & (IOMAP_UNSHARE | IOMAP_ZERO)) &&
- imap.br_startoff > offset_fsb) {
+ /* We never need to allocate blocks for unsharing a hole. */
+ if ((flags & IOMAP_UNSHARE) && imap.br_startoff > offset_fsb) {
+ xfs_hole_to_iomap(ip, iomap, offset_fsb, imap.br_startoff);
+ goto out_unlock;
+ }
+
+ /*
+ * We may need to zero over a hole in the data fork if it's fronted by
+ * COW blocks and dirty pagecache. To make sure zeroing occurs, force
+ * writeback to remap pending blocks and restart the lookup.
+ */
+ if ((flags & IOMAP_ZERO) && imap.br_startoff > offset_fsb) {
+ if (filemap_range_needs_writeback(inode->i_mapping, offset,
+ offset + count - 1)) {
+ xfs_iunlock(ip, lockmode);
+ error = filemap_write_and_wait_range(inode->i_mapping,
+ offset, offset + count - 1);
+ if (error)
+ return error;
+ goto restart;
+ }
xfs_hole_to_iomap(ip, iomap, offset_fsb, imap.br_startoff);
goto out_unlock;
}
--
2.51.0
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH 3/6] xfs: flush eof folio before insert range size update
2025-10-16 19:02 [PATCH 0/6] iomap, xfs: improve zero range flushing and lookup Brian Foster
2025-10-16 19:02 ` [PATCH 1/6] iomap: replace folio_batch allocation with stack allocation Brian Foster
2025-10-16 19:02 ` [PATCH 2/6] iomap, xfs: lift zero range hole mapping flush into xfs Brian Foster
@ 2025-10-16 19:03 ` Brian Foster
2025-11-05 0:14 ` Darrick J. Wong
2025-10-16 19:03 ` [PATCH 4/6] xfs: look up cow fork extent earlier for buffered iomap_begin Brian Foster
` (2 subsequent siblings)
5 siblings, 1 reply; 27+ messages in thread
From: Brian Foster @ 2025-10-16 19:03 UTC (permalink / raw)
To: linux-fsdevel, linux-xfs
The flush in xfs_buffered_write_iomap_begin() for zero range over a
data fork hole fronted by COW fork prealloc is primarily designed to
provide correct zeroing behavior in particular pagecache conditions.
As it turns out, this also partially masks some odd behavior in
insert range (via zero range via setattr).
Insert range bumps i_size the length of the new range, flushes,
unmaps pagecache and cancels COW prealloc, and then right shifts
extents from the end of the file back to the target offset of the
insert. Since the i_size update occurs before the pagecache flush,
this creates a transient situation where writeback around EOF can
behave differently.
This appears to be corner case situation, but if happens to be
fronted by COW fork speculative preallocation and a large, dirty
folio that contains at least one full COW block beyond EOF, the
writeback after i_size is bumped may remap that COW fork block into
the data fork within EOF. The block is zeroed and then shifted back
out to post-eof, but this is unexpected in that it leads to a
written post-eof data fork block. This can cause a zero range
warning on a subsequent size extension, because we should never find
blocks that require physical zeroing beyond i_size.
To avoid this quirk, flush the EOF folio before the i_size update
during insert range. The entire range will be flushed, unmapped and
invalidated anyways, so this should be relatively unnoticeable.
Signed-off-by: Brian Foster <bfoster@redhat.com>
---
fs/xfs/xfs_file.c | 17 +++++++++++++++++
1 file changed, 17 insertions(+)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 5b9864c8582e..cc3a9674ad40 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1226,6 +1226,23 @@ xfs_falloc_insert_range(
if (offset >= isize)
return -EINVAL;
+ /*
+ * Let writeback clean up EOF folio state before we bump i_size. The
+ * insert flushes before it starts shifting and under certain
+ * circumstances we can write back blocks that should technically be
+ * considered post-eof (and thus should not be submitted for writeback).
+ *
+ * For example, a large, dirty folio that spans EOF and is backed by
+ * post-eof COW fork preallocation can cause block remap into the data
+ * fork. This shifts back out beyond EOF, but creates an expectedly
+ * written post-eof block. The insert is going to flush, unmap and
+ * cancel prealloc across this whole range, so flush EOF now before we
+ * bump i_size to provide consistent behavior.
+ */
+ error = filemap_write_and_wait_range(inode->i_mapping, isize, isize);
+ if (error)
+ return error;
+
error = xfs_falloc_setsize(file, isize + len);
if (error)
return error;
--
2.51.0
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH 4/6] xfs: look up cow fork extent earlier for buffered iomap_begin
2025-10-16 19:02 [PATCH 0/6] iomap, xfs: improve zero range flushing and lookup Brian Foster
` (2 preceding siblings ...)
2025-10-16 19:03 ` [PATCH 3/6] xfs: flush eof folio before insert range size update Brian Foster
@ 2025-10-16 19:03 ` Brian Foster
2025-11-05 22:26 ` Darrick J. Wong
2025-10-16 19:03 ` [PATCH 5/6] xfs: only flush when COW fork blocks overlap data fork holes Brian Foster
2025-10-16 19:03 ` [PATCH 6/6] xfs: replace zero range flush with folio batch Brian Foster
5 siblings, 1 reply; 27+ messages in thread
From: Brian Foster @ 2025-10-16 19:03 UTC (permalink / raw)
To: linux-fsdevel, linux-xfs
To further isolate the need for flushing for zero range, we need to
know whether a hole in the data fork is fronted by blocks in the COW
fork or not. COW fork lookup currently occurs further down in the
function, after the zero range case is handled.
As a preparation step, lift the COW fork extent lookup to earlier in
the function, at the same time as the data fork lookup. Only the
lookup logic is lifted. The COW fork branch/reporting logic remains
as is to avoid any observable behavior change from an iomap
reporting perspective.
Signed-off-by: Brian Foster <bfoster@redhat.com>
---
fs/xfs/xfs_iomap.c | 46 +++++++++++++++++++++++++---------------------
1 file changed, 25 insertions(+), 21 deletions(-)
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index b84c94558cc9..ba5697d8b8fd 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1753,14 +1753,29 @@ xfs_buffered_write_iomap_begin(
goto out_unlock;
/*
- * Search the data fork first to look up our source mapping. We
- * always need the data fork map, as we have to return it to the
- * iomap code so that the higher level write code can read data in to
- * perform read-modify-write cycles for unaligned writes.
+ * Search the data fork first to look up our source mapping. We always
+ * need the data fork map, as we have to return it to the iomap code so
+ * that the higher level write code can read data in to perform
+ * read-modify-write cycles for unaligned writes.
+ *
+ * Then search the COW fork extent list even if we did not find a data
+ * fork extent. This serves two purposes: first this implements the
+ * speculative preallocation using cowextsize, so that we also unshare
+ * block adjacent to shared blocks instead of just the shared blocks
+ * themselves. Second the lookup in the extent list is generally faster
+ * than going out to the shared extent tree.
*/
eof = !xfs_iext_lookup_extent(ip, &ip->i_df, offset_fsb, &icur, &imap);
if (eof)
imap.br_startoff = end_fsb; /* fake hole until the end */
+ if (xfs_is_cow_inode(ip)) {
+ if (!ip->i_cowfp) {
+ ASSERT(!xfs_is_reflink_inode(ip));
+ xfs_ifork_init_cow(ip);
+ }
+ cow_eof = !xfs_iext_lookup_extent(ip, ip->i_cowfp, offset_fsb,
+ &ccur, &cmap);
+ }
/* We never need to allocate blocks for unsharing a hole. */
if ((flags & IOMAP_UNSHARE) && imap.br_startoff > offset_fsb) {
@@ -1827,24 +1842,13 @@ xfs_buffered_write_iomap_begin(
}
/*
- * Search the COW fork extent list even if we did not find a data fork
- * extent. This serves two purposes: first this implements the
- * speculative preallocation using cowextsize, so that we also unshare
- * block adjacent to shared blocks instead of just the shared blocks
- * themselves. Second the lookup in the extent list is generally faster
- * than going out to the shared extent tree.
+ * Now that we've handled any operation specific special cases, at this
+ * point we can report a COW mapping if found.
*/
- if (xfs_is_cow_inode(ip)) {
- if (!ip->i_cowfp) {
- ASSERT(!xfs_is_reflink_inode(ip));
- xfs_ifork_init_cow(ip);
- }
- cow_eof = !xfs_iext_lookup_extent(ip, ip->i_cowfp, offset_fsb,
- &ccur, &cmap);
- if (!cow_eof && cmap.br_startoff <= offset_fsb) {
- trace_xfs_reflink_cow_found(ip, &cmap);
- goto found_cow;
- }
+ if (xfs_is_cow_inode(ip) &&
+ !cow_eof && cmap.br_startoff <= offset_fsb) {
+ trace_xfs_reflink_cow_found(ip, &cmap);
+ goto found_cow;
}
if (imap.br_startoff <= offset_fsb) {
--
2.51.0
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH 5/6] xfs: only flush when COW fork blocks overlap data fork holes
2025-10-16 19:02 [PATCH 0/6] iomap, xfs: improve zero range flushing and lookup Brian Foster
` (3 preceding siblings ...)
2025-10-16 19:03 ` [PATCH 4/6] xfs: look up cow fork extent earlier for buffered iomap_begin Brian Foster
@ 2025-10-16 19:03 ` Brian Foster
2025-10-16 19:03 ` [PATCH 6/6] xfs: replace zero range flush with folio batch Brian Foster
5 siblings, 0 replies; 27+ messages in thread
From: Brian Foster @ 2025-10-16 19:03 UTC (permalink / raw)
To: linux-fsdevel, linux-xfs
The zero range hole mapping flush case has been lifted from iomap
into XFS. Now that we have more mapping context available from the
->iomap_begin() handler, we can isolate the flush further to when we
know a hole is fronted by COW blocks.
Rather than purely rely on pagecache dirty state, explicitly check
for the case where a range is a hole in both forks. Otherwise trim
to the range where there does happen to be overlap and use that for
the pagecache writeback check. This might prevent some spurious
zeroing, but more importantly makes it easier to remove the flush
entirely.
Signed-off-by: Brian Foster <bfoster@redhat.com>
---
fs/xfs/xfs_iomap.c | 36 ++++++++++++++++++++++++++++++------
1 file changed, 30 insertions(+), 6 deletions(-)
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index ba5697d8b8fd..29f1462819fa 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1704,10 +1704,12 @@ xfs_buffered_write_iomap_begin(
{
struct iomap_iter *iter = container_of(iomap, struct iomap_iter,
iomap);
+ struct address_space *mapping = inode->i_mapping;
struct xfs_inode *ip = XFS_I(inode);
struct xfs_mount *mp = ip->i_mount;
xfs_fileoff_t offset_fsb = XFS_B_TO_FSBT(mp, offset);
xfs_fileoff_t end_fsb = xfs_iomap_end_fsb(mp, offset, count);
+ xfs_fileoff_t cow_fsb = NULLFILEOFF;
struct xfs_bmbt_irec imap, cmap;
struct xfs_iext_cursor icur, ccur;
xfs_fsblock_t prealloc_blocks = 0;
@@ -1775,6 +1777,8 @@ xfs_buffered_write_iomap_begin(
}
cow_eof = !xfs_iext_lookup_extent(ip, ip->i_cowfp, offset_fsb,
&ccur, &cmap);
+ if (!cow_eof)
+ cow_fsb = cmap.br_startoff;
}
/* We never need to allocate blocks for unsharing a hole. */
@@ -1789,17 +1793,37 @@ xfs_buffered_write_iomap_begin(
* writeback to remap pending blocks and restart the lookup.
*/
if ((flags & IOMAP_ZERO) && imap.br_startoff > offset_fsb) {
- if (filemap_range_needs_writeback(inode->i_mapping, offset,
- offset + count - 1)) {
+ loff_t start, end;
+
+ imap.br_blockcount = imap.br_startoff - offset_fsb;
+ imap.br_startoff = offset_fsb;
+ imap.br_startblock = HOLESTARTBLOCK;
+ imap.br_state = XFS_EXT_NORM;
+
+ if (cow_fsb == NULLFILEOFF) {
+ goto found_imap;
+ } else if (cow_fsb > offset_fsb) {
+ xfs_trim_extent(&imap, offset_fsb,
+ cow_fsb - offset_fsb);
+ goto found_imap;
+ }
+
+ /* COW fork blocks overlap the hole */
+ xfs_trim_extent(&imap, offset_fsb,
+ cmap.br_startoff + cmap.br_blockcount - offset_fsb);
+ start = XFS_FSB_TO_B(mp, imap.br_startoff);
+ end = XFS_FSB_TO_B(mp,
+ imap.br_startoff + imap.br_blockcount) - 1;
+ if (filemap_range_needs_writeback(mapping, start, end)) {
xfs_iunlock(ip, lockmode);
- error = filemap_write_and_wait_range(inode->i_mapping,
- offset, offset + count - 1);
+ error = filemap_write_and_wait_range(mapping, start,
+ end);
if (error)
return error;
goto restart;
}
- xfs_hole_to_iomap(ip, iomap, offset_fsb, imap.br_startoff);
- goto out_unlock;
+
+ goto found_imap;
}
/*
--
2.51.0
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH 6/6] xfs: replace zero range flush with folio batch
2025-10-16 19:02 [PATCH 0/6] iomap, xfs: improve zero range flushing and lookup Brian Foster
` (4 preceding siblings ...)
2025-10-16 19:03 ` [PATCH 5/6] xfs: only flush when COW fork blocks overlap data fork holes Brian Foster
@ 2025-10-16 19:03 ` Brian Foster
2025-11-05 22:37 ` Darrick J. Wong
5 siblings, 1 reply; 27+ messages in thread
From: Brian Foster @ 2025-10-16 19:03 UTC (permalink / raw)
To: linux-fsdevel, linux-xfs
Now that the zero range pagecache flush is purely isolated to
providing zeroing correctness in this case, we can remove it and
replace it with the folio batch mechanism that is used for handling
unwritten extents.
This is still slightly odd in that XFS reports a hole vs. a mapping
that reflects the COW fork extents, but that has always been the
case in this situation and so a separate issue. We drop the iomap
warning that assumes the folio batch is always associated with
unwritten mappings, but this is mainly a development assertion as
otherwise the core iomap fbatch code doesn't care much about the
mapping type if it's handed the set of folios to process.
Signed-off-by: Brian Foster <bfoster@redhat.com>
---
fs/iomap/buffered-io.c | 4 ----
fs/xfs/xfs_iomap.c | 16 ++++------------
2 files changed, 4 insertions(+), 16 deletions(-)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index d6de689374c3..7bc4b8d090ee 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1534,10 +1534,6 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
while ((ret = iomap_iter(&iter, ops)) > 0) {
const struct iomap *srcmap = iomap_iter_srcmap(&iter);
- if (WARN_ON_ONCE((iter.iomap.flags & IOMAP_F_FOLIO_BATCH) &&
- srcmap->type != IOMAP_UNWRITTEN))
- return -EIO;
-
if (!(iter.iomap.flags & IOMAP_F_FOLIO_BATCH) &&
(srcmap->type == IOMAP_HOLE ||
srcmap->type == IOMAP_UNWRITTEN)) {
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 29f1462819fa..5a845a0ded79 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1704,7 +1704,6 @@ xfs_buffered_write_iomap_begin(
{
struct iomap_iter *iter = container_of(iomap, struct iomap_iter,
iomap);
- struct address_space *mapping = inode->i_mapping;
struct xfs_inode *ip = XFS_I(inode);
struct xfs_mount *mp = ip->i_mount;
xfs_fileoff_t offset_fsb = XFS_B_TO_FSBT(mp, offset);
@@ -1736,7 +1735,6 @@ xfs_buffered_write_iomap_begin(
if (error)
return error;
-restart:
error = xfs_ilock_for_iomap(ip, flags, &lockmode);
if (error)
return error;
@@ -1812,16 +1810,10 @@ xfs_buffered_write_iomap_begin(
xfs_trim_extent(&imap, offset_fsb,
cmap.br_startoff + cmap.br_blockcount - offset_fsb);
start = XFS_FSB_TO_B(mp, imap.br_startoff);
- end = XFS_FSB_TO_B(mp,
- imap.br_startoff + imap.br_blockcount) - 1;
- if (filemap_range_needs_writeback(mapping, start, end)) {
- xfs_iunlock(ip, lockmode);
- error = filemap_write_and_wait_range(mapping, start,
- end);
- if (error)
- return error;
- goto restart;
- }
+ end = XFS_FSB_TO_B(mp, imap.br_startoff + imap.br_blockcount);
+ iomap_flags |= iomap_fill_dirty_folios(iter, &start, end);
+ xfs_trim_extent(&imap, offset_fsb,
+ XFS_B_TO_FSB(mp, start) - offset_fsb);
goto found_imap;
}
--
2.51.0
^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [PATCH 1/6] iomap: replace folio_batch allocation with stack allocation
2025-10-16 19:02 ` [PATCH 1/6] iomap: replace folio_batch allocation with stack allocation Brian Foster
@ 2025-11-05 0:07 ` Darrick J. Wong
2025-11-05 15:27 ` Brian Foster
0 siblings, 1 reply; 27+ messages in thread
From: Darrick J. Wong @ 2025-11-05 0:07 UTC (permalink / raw)
To: Brian Foster; +Cc: linux-fsdevel, linux-xfs
On Thu, Oct 16, 2025 at 03:02:58PM -0400, Brian Foster wrote:
> Zhang Yi points out that the dynamic folio_batch allocation in
> iomap_fill_dirty_folios() is problematic for the ext4 on iomap work
> that is under development because it doesn't sufficiently handle the
> allocation failure case (by allowing a retry, for example).
>
> The dynamic allocation was initially added for simplicity and to
> help indicate whether the batch was used or not by the calling fs.
> To address this issue, put the batch on the stack of
> iomap_zero_range() and use a flag to control whether the batch
> should be used in the iomap folio lookup path. This keeps things
> simple and eliminates the concern for ext4 on iomap.
>
> Signed-off-by: Brian Foster <bfoster@redhat.com>
Hrmm, so who kmallocs the fbatch array now? Is that left as an exercise
to the filesystem? I'm confused because I don't see the kmalloc call
reappear elsewhere, at least not in this patch.
--D
> ---
> fs/iomap/buffered-io.c | 45 ++++++++++++++++++++++++++++--------------
> fs/iomap/iter.c | 6 +++---
> fs/xfs/xfs_iomap.c | 11 ++++++-----
> include/linux/iomap.h | 8 ++++++--
> 4 files changed, 45 insertions(+), 25 deletions(-)
>
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 51ecb6d48feb..05ff82c5432e 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -761,7 +761,7 @@ static struct folio *__iomap_get_folio(struct iomap_iter *iter,
> if (!mapping_large_folio_support(iter->inode->i_mapping))
> len = min_t(size_t, len, PAGE_SIZE - offset_in_page(pos));
>
> - if (iter->fbatch) {
> + if (iter->iomap.flags & IOMAP_F_FOLIO_BATCH) {
> struct folio *folio = folio_batch_next(iter->fbatch);
>
> if (!folio)
> @@ -858,7 +858,7 @@ static int iomap_write_begin(struct iomap_iter *iter,
> * process so return and let the caller iterate and refill the batch.
> */
> if (!folio) {
> - WARN_ON_ONCE(!iter->fbatch);
> + WARN_ON_ONCE(!(iter->iomap.flags & IOMAP_F_FOLIO_BATCH));
> return 0;
> }
>
> @@ -1473,23 +1473,34 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero,
> return status;
> }
>
> -loff_t
> +/**
> + * iomap_fill_dirty_folios - fill a folio batch with dirty folios
> + * @iter: Iteration structure
> + * @start: Start offset of range. Updated based on lookup progress.
> + * @end: End offset of range
> + *
> + * Returns the associated control flag if the folio batch is available and the
> + * lookup performed. The caller is responsible to set the flag on the associated
> + * iomap.
> + */
> +unsigned int
> iomap_fill_dirty_folios(
> struct iomap_iter *iter,
> - loff_t offset,
> - loff_t length)
> + loff_t *start,
> + loff_t end)
> {
> struct address_space *mapping = iter->inode->i_mapping;
> - pgoff_t start = offset >> PAGE_SHIFT;
> - pgoff_t end = (offset + length - 1) >> PAGE_SHIFT;
> + pgoff_t pstart = *start >> PAGE_SHIFT;
> + pgoff_t pend = (end - 1) >> PAGE_SHIFT;
>
> - iter->fbatch = kmalloc(sizeof(struct folio_batch), GFP_KERNEL);
> - if (!iter->fbatch)
> - return offset + length;
> - folio_batch_init(iter->fbatch);
> + if (!iter->fbatch) {
> + *start = end;
> + return 0;
> + }
>
> - filemap_get_folios_dirty(mapping, &start, end, iter->fbatch);
> - return (start << PAGE_SHIFT);
> + filemap_get_folios_dirty(mapping, &pstart, pend, iter->fbatch);
> + *start = (pstart << PAGE_SHIFT);
> + return IOMAP_F_FOLIO_BATCH;
> }
> EXPORT_SYMBOL_GPL(iomap_fill_dirty_folios);
>
> @@ -1498,17 +1509,21 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
> const struct iomap_ops *ops,
> const struct iomap_write_ops *write_ops, void *private)
> {
> + struct folio_batch fbatch;
> struct iomap_iter iter = {
> .inode = inode,
> .pos = pos,
> .len = len,
> .flags = IOMAP_ZERO,
> .private = private,
> + .fbatch = &fbatch,
> };
> struct address_space *mapping = inode->i_mapping;
> int ret;
> bool range_dirty;
>
> + folio_batch_init(&fbatch);
> +
> /*
> * To avoid an unconditional flush, check pagecache state and only flush
> * if dirty and the fs returns a mapping that might convert on
> @@ -1519,11 +1534,11 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
> while ((ret = iomap_iter(&iter, ops)) > 0) {
> const struct iomap *srcmap = iomap_iter_srcmap(&iter);
>
> - if (WARN_ON_ONCE(iter.fbatch &&
> + if (WARN_ON_ONCE((iter.iomap.flags & IOMAP_F_FOLIO_BATCH) &&
> srcmap->type != IOMAP_UNWRITTEN))
> return -EIO;
>
> - if (!iter.fbatch &&
> + if (!(iter.iomap.flags & IOMAP_F_FOLIO_BATCH) &&
> (srcmap->type == IOMAP_HOLE ||
> srcmap->type == IOMAP_UNWRITTEN)) {
> s64 status;
> diff --git a/fs/iomap/iter.c b/fs/iomap/iter.c
> index 66ca12aac57d..026d85823c76 100644
> --- a/fs/iomap/iter.c
> +++ b/fs/iomap/iter.c
> @@ -8,10 +8,10 @@
>
> static inline void iomap_iter_reset_iomap(struct iomap_iter *iter)
> {
> - if (iter->fbatch) {
> + if (iter->iomap.flags & IOMAP_F_FOLIO_BATCH) {
> folio_batch_release(iter->fbatch);
> - kfree(iter->fbatch);
> - iter->fbatch = NULL;
> + folio_batch_reinit(iter->fbatch);
> + iter->iomap.flags &= ~IOMAP_F_FOLIO_BATCH;
> }
>
> iter->status = 0;
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 535bf3b8705d..01833aca37ac 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -1775,7 +1775,6 @@ xfs_buffered_write_iomap_begin(
> */
> if (flags & IOMAP_ZERO) {
> xfs_fileoff_t eof_fsb = XFS_B_TO_FSB(mp, XFS_ISIZE(ip));
> - u64 end;
>
> if (isnullstartblock(imap.br_startblock) &&
> offset_fsb >= eof_fsb)
> @@ -1795,12 +1794,14 @@ xfs_buffered_write_iomap_begin(
> */
> if (imap.br_state == XFS_EXT_UNWRITTEN &&
> offset_fsb < eof_fsb) {
> - loff_t len = min(count,
> - XFS_FSB_TO_B(mp, imap.br_blockcount));
> + loff_t foffset = offset, fend;
>
> - end = iomap_fill_dirty_folios(iter, offset, len);
> + fend = offset +
> + min(count, XFS_FSB_TO_B(mp, imap.br_blockcount));
> + iomap_flags |= iomap_fill_dirty_folios(iter, &foffset,
> + fend);
> end_fsb = min_t(xfs_fileoff_t, end_fsb,
> - XFS_B_TO_FSB(mp, end));
> + XFS_B_TO_FSB(mp, foffset));
> }
>
> xfs_trim_extent(&imap, offset_fsb, end_fsb - offset_fsb);
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index cd0f573156d6..79da917ff45e 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -87,6 +87,9 @@ struct vm_fault;
> /*
> * Flags set by the core iomap code during operations:
> *
> + * IOMAP_F_FOLIO_BATCH indicates that the folio batch mechanism is active
> + * for this operation, set by iomap_fill_dirty_folios().
> + *
> * IOMAP_F_SIZE_CHANGED indicates to the iomap_end method that the file size
> * has changed as the result of this write operation.
> *
> @@ -94,6 +97,7 @@ struct vm_fault;
> * range it covers needs to be remapped by the high level before the operation
> * can proceed.
> */
> +#define IOMAP_F_FOLIO_BATCH (1U << 13)
> #define IOMAP_F_SIZE_CHANGED (1U << 14)
> #define IOMAP_F_STALE (1U << 15)
>
> @@ -351,8 +355,8 @@ bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio);
> int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
> const struct iomap_ops *ops,
> const struct iomap_write_ops *write_ops);
> -loff_t iomap_fill_dirty_folios(struct iomap_iter *iter, loff_t offset,
> - loff_t length);
> +unsigned int iomap_fill_dirty_folios(struct iomap_iter *iter, loff_t *start,
> + loff_t end);
> int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len,
> bool *did_zero, const struct iomap_ops *ops,
> const struct iomap_write_ops *write_ops, void *private);
> --
> 2.51.0
>
>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 3/6] xfs: flush eof folio before insert range size update
2025-10-16 19:03 ` [PATCH 3/6] xfs: flush eof folio before insert range size update Brian Foster
@ 2025-11-05 0:14 ` Darrick J. Wong
2025-11-05 15:34 ` Brian Foster
0 siblings, 1 reply; 27+ messages in thread
From: Darrick J. Wong @ 2025-11-05 0:14 UTC (permalink / raw)
To: Brian Foster; +Cc: linux-fsdevel, linux-xfs
On Thu, Oct 16, 2025 at 03:03:00PM -0400, Brian Foster wrote:
> The flush in xfs_buffered_write_iomap_begin() for zero range over a
> data fork hole fronted by COW fork prealloc is primarily designed to
> provide correct zeroing behavior in particular pagecache conditions.
> As it turns out, this also partially masks some odd behavior in
> insert range (via zero range via setattr).
>
> Insert range bumps i_size the length of the new range, flushes,
> unmaps pagecache and cancels COW prealloc, and then right shifts
> extents from the end of the file back to the target offset of the
> insert. Since the i_size update occurs before the pagecache flush,
> this creates a transient situation where writeback around EOF can
> behave differently.
Why not flush the file from @offset to EOF, flush the COW
preallocations, extend i_size, and only then start shifting extents?
That would seem a lot more straightforward to me.
--D
> This appears to be corner case situation, but if happens to be
> fronted by COW fork speculative preallocation and a large, dirty
> folio that contains at least one full COW block beyond EOF, the
> writeback after i_size is bumped may remap that COW fork block into
> the data fork within EOF. The block is zeroed and then shifted back
> out to post-eof, but this is unexpected in that it leads to a
> written post-eof data fork block. This can cause a zero range
> warning on a subsequent size extension, because we should never find
> blocks that require physical zeroing beyond i_size.
>
> To avoid this quirk, flush the EOF folio before the i_size update
> during insert range. The entire range will be flushed, unmapped and
> invalidated anyways, so this should be relatively unnoticeable.
>
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
> fs/xfs/xfs_file.c | 17 +++++++++++++++++
> 1 file changed, 17 insertions(+)
>
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 5b9864c8582e..cc3a9674ad40 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1226,6 +1226,23 @@ xfs_falloc_insert_range(
> if (offset >= isize)
> return -EINVAL;
>
> + /*
> + * Let writeback clean up EOF folio state before we bump i_size. The
> + * insert flushes before it starts shifting and under certain
> + * circumstances we can write back blocks that should technically be
> + * considered post-eof (and thus should not be submitted for writeback).
> + *
> + * For example, a large, dirty folio that spans EOF and is backed by
> + * post-eof COW fork preallocation can cause block remap into the data
> + * fork. This shifts back out beyond EOF, but creates an expectedly
> + * written post-eof block. The insert is going to flush, unmap and
> + * cancel prealloc across this whole range, so flush EOF now before we
> + * bump i_size to provide consistent behavior.
> + */
> + error = filemap_write_and_wait_range(inode->i_mapping, isize, isize);
> + if (error)
> + return error;
> +
> error = xfs_falloc_setsize(file, isize + len);
> if (error)
> return error;
> --
> 2.51.0
>
>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 2/6] iomap, xfs: lift zero range hole mapping flush into xfs
2025-10-16 19:02 ` [PATCH 2/6] iomap, xfs: lift zero range hole mapping flush into xfs Brian Foster
@ 2025-11-05 0:31 ` Darrick J. Wong
2025-11-05 15:33 ` Brian Foster
0 siblings, 1 reply; 27+ messages in thread
From: Darrick J. Wong @ 2025-11-05 0:31 UTC (permalink / raw)
To: Brian Foster; +Cc: linux-fsdevel, linux-xfs
On Thu, Oct 16, 2025 at 03:02:59PM -0400, Brian Foster wrote:
> iomap zero range has a wart in that it also flushes dirty pagecache
> over hole mappings (rather than only unwritten mappings). This was
> included to accommodate a quirk in XFS where COW fork preallocation
> can exist over a hole in the data fork, and the associated range is
> reported as a hole. This is because the range actually is a hole,
> but XFS also has an optimization where if COW fork blocks exist for
> a range being written to, those blocks are used regardless of
> whether the data fork blocks are shared or not. For zeroing, COW
> fork blocks over a data fork hole are only relevant if the range is
> dirty in pagecache, otherwise the range is already considered
> zeroed.
It occurs to me that the situation (unwritten cow mapping, hole in data
fork) results in iomap_iter::iomap getting the unwritten mapping, and
iomap_iter::srcmap getting the hole mapping. iomap_iter_srcmap returns
iomap_itere::iomap because srcmap.type == HOLE.
But then you have ext4 where there is no cow fork, so it will only ever
set iomap_iter::iomap, leaving iomap_iter::srcmap set to the default.
The default srcmap is a HOLE.
So iomap can't distinguish between xfs' speculative cow over a hole
behavior vs. ext4 just being simple. I wonder if we actually need to
introduce a new iomap type for "pure overwrite"?
The reason I say that that in designing the fuse-iomap uapi, it was a
lot easier to understand the programming model if there was always
explicit read and write mappings being sent back and forth; and a new
type FUSE_IOMAP_TYPE_PURE_OVERWRITE that could be stored in the write
mapping to mean "just look at the read mapping". If such a beast were
ported to the core iomap code then maybe that would help here?
A hole with an out-of-place mapping needs a flush (or maybe just go find
the pagecache and zero it), whereas a hole with nothing else backing it
clearly doesn't need any action at all.
Does that help?
--D
> The easiest way to deal with this corner case is to flush the
> pagecache to trigger COW remapping into the data fork, and then
> operate on the updated on-disk state. The problem is that ext4
> cannot accommodate a flush from this context due to being a
> transaction deadlock vector.
>
> Outside of the hole quirk, ext4 can avoid the flush for zero range
> by using the recently introduced folio batch lookup mechanism for
> unwritten mappings. Therefore, take the next logical step and lift
> the hole handling logic into the XFS iomap_begin handler. iomap will
> still flush on unwritten mappings without a folio batch, and XFS
> will flush and retry mapping lookups in the case where it would
> otherwise report a hole with dirty pagecache during a zero range.
>
> Note that this is intended to be a fairly straightforward lift and
> otherwise not change behavior. Now that the flush exists within XFS,
> follow on patches can further optimize it.
>
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
> fs/iomap/buffered-io.c | 2 +-
> fs/xfs/xfs_iomap.c | 25 ++++++++++++++++++++++---
> 2 files changed, 23 insertions(+), 4 deletions(-)
>
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 05ff82c5432e..d6de689374c3 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -1543,7 +1543,7 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
> srcmap->type == IOMAP_UNWRITTEN)) {
> s64 status;
>
> - if (range_dirty) {
> + if (range_dirty && srcmap->type == IOMAP_UNWRITTEN) {
> range_dirty = false;
> status = iomap_zero_iter_flush_and_stale(&iter);
> } else {
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 01833aca37ac..b84c94558cc9 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -1734,6 +1734,7 @@ xfs_buffered_write_iomap_begin(
> if (error)
> return error;
>
> +restart:
> error = xfs_ilock_for_iomap(ip, flags, &lockmode);
> if (error)
> return error;
> @@ -1761,9 +1762,27 @@ xfs_buffered_write_iomap_begin(
> if (eof)
> imap.br_startoff = end_fsb; /* fake hole until the end */
>
> - /* We never need to allocate blocks for zeroing or unsharing a hole. */
> - if ((flags & (IOMAP_UNSHARE | IOMAP_ZERO)) &&
> - imap.br_startoff > offset_fsb) {
> + /* We never need to allocate blocks for unsharing a hole. */
> + if ((flags & IOMAP_UNSHARE) && imap.br_startoff > offset_fsb) {
> + xfs_hole_to_iomap(ip, iomap, offset_fsb, imap.br_startoff);
> + goto out_unlock;
> + }
> +
> + /*
> + * We may need to zero over a hole in the data fork if it's fronted by
> + * COW blocks and dirty pagecache. To make sure zeroing occurs, force
> + * writeback to remap pending blocks and restart the lookup.
> + */
> + if ((flags & IOMAP_ZERO) && imap.br_startoff > offset_fsb) {
> + if (filemap_range_needs_writeback(inode->i_mapping, offset,
> + offset + count - 1)) {
> + xfs_iunlock(ip, lockmode);
> + error = filemap_write_and_wait_range(inode->i_mapping,
> + offset, offset + count - 1);
> + if (error)
> + return error;
> + goto restart;
> + }
> xfs_hole_to_iomap(ip, iomap, offset_fsb, imap.br_startoff);
> goto out_unlock;
> }
> --
> 2.51.0
>
>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/6] iomap: replace folio_batch allocation with stack allocation
2025-11-05 0:07 ` Darrick J. Wong
@ 2025-11-05 15:27 ` Brian Foster
2025-11-05 21:41 ` Darrick J. Wong
0 siblings, 1 reply; 27+ messages in thread
From: Brian Foster @ 2025-11-05 15:27 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: linux-fsdevel, linux-xfs
On Tue, Nov 04, 2025 at 04:07:16PM -0800, Darrick J. Wong wrote:
> On Thu, Oct 16, 2025 at 03:02:58PM -0400, Brian Foster wrote:
> > Zhang Yi points out that the dynamic folio_batch allocation in
> > iomap_fill_dirty_folios() is problematic for the ext4 on iomap work
> > that is under development because it doesn't sufficiently handle the
> > allocation failure case (by allowing a retry, for example).
> >
> > The dynamic allocation was initially added for simplicity and to
> > help indicate whether the batch was used or not by the calling fs.
> > To address this issue, put the batch on the stack of
> > iomap_zero_range() and use a flag to control whether the batch
> > should be used in the iomap folio lookup path. This keeps things
> > simple and eliminates the concern for ext4 on iomap.
> >
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
>
> Hrmm, so who kmallocs the fbatch array now? Is that left as an exercise
> to the filesystem? I'm confused because I don't see the kmalloc call
> reappear elsewhere, at least not in this patch.
>
It's no longer dynamically allocated. It's allocated on the stack in
iomap_zero_range(), and a flag is set on the iomap if the pagecache
lookup occurs. The allocation is a potential problem for the ext4 on
iomap port, so this elides the need for dealing with alloc failures and
whatnot.
This is probably how it should have been done from the start, but when
the flag related feedback came along it was deep enough in test/review
cycle that I preferred to do it separately.
Brian
> --D
>
> > ---
> > fs/iomap/buffered-io.c | 45 ++++++++++++++++++++++++++++--------------
> > fs/iomap/iter.c | 6 +++---
> > fs/xfs/xfs_iomap.c | 11 ++++++-----
> > include/linux/iomap.h | 8 ++++++--
> > 4 files changed, 45 insertions(+), 25 deletions(-)
> >
> > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > index 51ecb6d48feb..05ff82c5432e 100644
> > --- a/fs/iomap/buffered-io.c
> > +++ b/fs/iomap/buffered-io.c
> > @@ -761,7 +761,7 @@ static struct folio *__iomap_get_folio(struct iomap_iter *iter,
> > if (!mapping_large_folio_support(iter->inode->i_mapping))
> > len = min_t(size_t, len, PAGE_SIZE - offset_in_page(pos));
> >
> > - if (iter->fbatch) {
> > + if (iter->iomap.flags & IOMAP_F_FOLIO_BATCH) {
> > struct folio *folio = folio_batch_next(iter->fbatch);
> >
> > if (!folio)
> > @@ -858,7 +858,7 @@ static int iomap_write_begin(struct iomap_iter *iter,
> > * process so return and let the caller iterate and refill the batch.
> > */
> > if (!folio) {
> > - WARN_ON_ONCE(!iter->fbatch);
> > + WARN_ON_ONCE(!(iter->iomap.flags & IOMAP_F_FOLIO_BATCH));
> > return 0;
> > }
> >
> > @@ -1473,23 +1473,34 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero,
> > return status;
> > }
> >
> > -loff_t
> > +/**
> > + * iomap_fill_dirty_folios - fill a folio batch with dirty folios
> > + * @iter: Iteration structure
> > + * @start: Start offset of range. Updated based on lookup progress.
> > + * @end: End offset of range
> > + *
> > + * Returns the associated control flag if the folio batch is available and the
> > + * lookup performed. The caller is responsible to set the flag on the associated
> > + * iomap.
> > + */
> > +unsigned int
> > iomap_fill_dirty_folios(
> > struct iomap_iter *iter,
> > - loff_t offset,
> > - loff_t length)
> > + loff_t *start,
> > + loff_t end)
> > {
> > struct address_space *mapping = iter->inode->i_mapping;
> > - pgoff_t start = offset >> PAGE_SHIFT;
> > - pgoff_t end = (offset + length - 1) >> PAGE_SHIFT;
> > + pgoff_t pstart = *start >> PAGE_SHIFT;
> > + pgoff_t pend = (end - 1) >> PAGE_SHIFT;
> >
> > - iter->fbatch = kmalloc(sizeof(struct folio_batch), GFP_KERNEL);
> > - if (!iter->fbatch)
> > - return offset + length;
> > - folio_batch_init(iter->fbatch);
> > + if (!iter->fbatch) {
> > + *start = end;
> > + return 0;
> > + }
> >
> > - filemap_get_folios_dirty(mapping, &start, end, iter->fbatch);
> > - return (start << PAGE_SHIFT);
> > + filemap_get_folios_dirty(mapping, &pstart, pend, iter->fbatch);
> > + *start = (pstart << PAGE_SHIFT);
> > + return IOMAP_F_FOLIO_BATCH;
> > }
> > EXPORT_SYMBOL_GPL(iomap_fill_dirty_folios);
> >
> > @@ -1498,17 +1509,21 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
> > const struct iomap_ops *ops,
> > const struct iomap_write_ops *write_ops, void *private)
> > {
> > + struct folio_batch fbatch;
> > struct iomap_iter iter = {
> > .inode = inode,
> > .pos = pos,
> > .len = len,
> > .flags = IOMAP_ZERO,
> > .private = private,
> > + .fbatch = &fbatch,
> > };
> > struct address_space *mapping = inode->i_mapping;
> > int ret;
> > bool range_dirty;
> >
> > + folio_batch_init(&fbatch);
> > +
> > /*
> > * To avoid an unconditional flush, check pagecache state and only flush
> > * if dirty and the fs returns a mapping that might convert on
> > @@ -1519,11 +1534,11 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
> > while ((ret = iomap_iter(&iter, ops)) > 0) {
> > const struct iomap *srcmap = iomap_iter_srcmap(&iter);
> >
> > - if (WARN_ON_ONCE(iter.fbatch &&
> > + if (WARN_ON_ONCE((iter.iomap.flags & IOMAP_F_FOLIO_BATCH) &&
> > srcmap->type != IOMAP_UNWRITTEN))
> > return -EIO;
> >
> > - if (!iter.fbatch &&
> > + if (!(iter.iomap.flags & IOMAP_F_FOLIO_BATCH) &&
> > (srcmap->type == IOMAP_HOLE ||
> > srcmap->type == IOMAP_UNWRITTEN)) {
> > s64 status;
> > diff --git a/fs/iomap/iter.c b/fs/iomap/iter.c
> > index 66ca12aac57d..026d85823c76 100644
> > --- a/fs/iomap/iter.c
> > +++ b/fs/iomap/iter.c
> > @@ -8,10 +8,10 @@
> >
> > static inline void iomap_iter_reset_iomap(struct iomap_iter *iter)
> > {
> > - if (iter->fbatch) {
> > + if (iter->iomap.flags & IOMAP_F_FOLIO_BATCH) {
> > folio_batch_release(iter->fbatch);
> > - kfree(iter->fbatch);
> > - iter->fbatch = NULL;
> > + folio_batch_reinit(iter->fbatch);
> > + iter->iomap.flags &= ~IOMAP_F_FOLIO_BATCH;
> > }
> >
> > iter->status = 0;
> > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > index 535bf3b8705d..01833aca37ac 100644
> > --- a/fs/xfs/xfs_iomap.c
> > +++ b/fs/xfs/xfs_iomap.c
> > @@ -1775,7 +1775,6 @@ xfs_buffered_write_iomap_begin(
> > */
> > if (flags & IOMAP_ZERO) {
> > xfs_fileoff_t eof_fsb = XFS_B_TO_FSB(mp, XFS_ISIZE(ip));
> > - u64 end;
> >
> > if (isnullstartblock(imap.br_startblock) &&
> > offset_fsb >= eof_fsb)
> > @@ -1795,12 +1794,14 @@ xfs_buffered_write_iomap_begin(
> > */
> > if (imap.br_state == XFS_EXT_UNWRITTEN &&
> > offset_fsb < eof_fsb) {
> > - loff_t len = min(count,
> > - XFS_FSB_TO_B(mp, imap.br_blockcount));
> > + loff_t foffset = offset, fend;
> >
> > - end = iomap_fill_dirty_folios(iter, offset, len);
> > + fend = offset +
> > + min(count, XFS_FSB_TO_B(mp, imap.br_blockcount));
> > + iomap_flags |= iomap_fill_dirty_folios(iter, &foffset,
> > + fend);
> > end_fsb = min_t(xfs_fileoff_t, end_fsb,
> > - XFS_B_TO_FSB(mp, end));
> > + XFS_B_TO_FSB(mp, foffset));
> > }
> >
> > xfs_trim_extent(&imap, offset_fsb, end_fsb - offset_fsb);
> > diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> > index cd0f573156d6..79da917ff45e 100644
> > --- a/include/linux/iomap.h
> > +++ b/include/linux/iomap.h
> > @@ -87,6 +87,9 @@ struct vm_fault;
> > /*
> > * Flags set by the core iomap code during operations:
> > *
> > + * IOMAP_F_FOLIO_BATCH indicates that the folio batch mechanism is active
> > + * for this operation, set by iomap_fill_dirty_folios().
> > + *
> > * IOMAP_F_SIZE_CHANGED indicates to the iomap_end method that the file size
> > * has changed as the result of this write operation.
> > *
> > @@ -94,6 +97,7 @@ struct vm_fault;
> > * range it covers needs to be remapped by the high level before the operation
> > * can proceed.
> > */
> > +#define IOMAP_F_FOLIO_BATCH (1U << 13)
> > #define IOMAP_F_SIZE_CHANGED (1U << 14)
> > #define IOMAP_F_STALE (1U << 15)
> >
> > @@ -351,8 +355,8 @@ bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio);
> > int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
> > const struct iomap_ops *ops,
> > const struct iomap_write_ops *write_ops);
> > -loff_t iomap_fill_dirty_folios(struct iomap_iter *iter, loff_t offset,
> > - loff_t length);
> > +unsigned int iomap_fill_dirty_folios(struct iomap_iter *iter, loff_t *start,
> > + loff_t end);
> > int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len,
> > bool *did_zero, const struct iomap_ops *ops,
> > const struct iomap_write_ops *write_ops, void *private);
> > --
> > 2.51.0
> >
> >
>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 2/6] iomap, xfs: lift zero range hole mapping flush into xfs
2025-11-05 0:31 ` Darrick J. Wong
@ 2025-11-05 15:33 ` Brian Foster
2025-11-05 22:23 ` Darrick J. Wong
0 siblings, 1 reply; 27+ messages in thread
From: Brian Foster @ 2025-11-05 15:33 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: linux-fsdevel, linux-xfs
On Tue, Nov 04, 2025 at 04:31:14PM -0800, Darrick J. Wong wrote:
> On Thu, Oct 16, 2025 at 03:02:59PM -0400, Brian Foster wrote:
> > iomap zero range has a wart in that it also flushes dirty pagecache
> > over hole mappings (rather than only unwritten mappings). This was
> > included to accommodate a quirk in XFS where COW fork preallocation
> > can exist over a hole in the data fork, and the associated range is
> > reported as a hole. This is because the range actually is a hole,
> > but XFS also has an optimization where if COW fork blocks exist for
> > a range being written to, those blocks are used regardless of
> > whether the data fork blocks are shared or not. For zeroing, COW
> > fork blocks over a data fork hole are only relevant if the range is
> > dirty in pagecache, otherwise the range is already considered
> > zeroed.
>
> It occurs to me that the situation (unwritten cow mapping, hole in data
> fork) results in iomap_iter::iomap getting the unwritten mapping, and
> iomap_iter::srcmap getting the hole mapping. iomap_iter_srcmap returns
> iomap_itere::iomap because srcmap.type == HOLE.
>
> But then you have ext4 where there is no cow fork, so it will only ever
> set iomap_iter::iomap, leaving iomap_iter::srcmap set to the default.
> The default srcmap is a HOLE.
>
> So iomap can't distinguish between xfs' speculative cow over a hole
> behavior vs. ext4 just being simple. I wonder if we actually need to
> introduce a new iomap type for "pure overwrite"?
>
I definitely think we need a better solution here in iomap. The current
iomap/srcmap management/handling is quite confusing. What that solution
is, I'm not sure.
> The reason I say that that in designing the fuse-iomap uapi, it was a
> lot easier to understand the programming model if there was always
> explicit read and write mappings being sent back and forth; and a new
> type FUSE_IOMAP_TYPE_PURE_OVERWRITE that could be stored in the write
> mapping to mean "just look at the read mapping". If such a beast were
> ported to the core iomap code then maybe that would help here?
>
I'm not following what this means. Separate read/write mappings for each
individual iomap operation (i.e. "read from here, write to there"), or
separate iomap structures to be used for read ops vs. write ops, or
something else..?
> A hole with an out-of-place mapping needs a flush (or maybe just go find
> the pagecache and zero it), whereas a hole with nothing else backing it
> clearly doesn't need any action at all.
>
> Does that help?
>
This kind of sounds like what we're already doing in iomap, so I suspect
I'm missing something on the fuse side...
WRT this patchset, I'm trying to address the underlying problems that
require the flush-a-dirty-hole hack that provides zeroing correctness
for XFS. This needs to be lifted out of iomap because it also causes
problems for ext4, but can be bypassed completely for XFS as well by the
end of the series. The first part is just a straight lift into xfs, but
the next patches replace the flush with use of the folio batch, and
split off the band-aid case down into insert range where this flush was
also indirectly suppressing issues.
For the former, if you look at the last few patches the main reason we
rely on the flush-a-dirty-hole hack is that we don't actually report the
mappings correctly in XFS for zero range with respect to allowable
behavior. We just report a hole if one exists in the data fork. So this
is trying to encode that "COW fork prealloc over data fork hole"
scenario correctly for zero range, and also identify when we need to
consider whether the mapping range is dirty in cache (i.e. unwritten COW
blocks).
So yes in general I think we need to improve on iomap reporting somehow,
but I don't necessarily see how that avoids the need (or desire) to fix
up the iomap_begin logic. I also think it's confusing enough that it
should probably be a separate discussion (I'd probably need to stare at
the fuse-related proposition to grok it).
Ultimately the flush in zero range should go away completely except for
the default/fallback case where the fs supports zero range, fails to
check pagecache itself, and iomap has otherwise detected that the range
over an unwritten mapping was dirty. There has been some discussion over
potentially lifting the batch lookup into iomap as well, but there are
some details that would need to be worked out to determine whether that
can be done safely.
Brian
> --D
>
> > The easiest way to deal with this corner case is to flush the
> > pagecache to trigger COW remapping into the data fork, and then
> > operate on the updated on-disk state. The problem is that ext4
> > cannot accommodate a flush from this context due to being a
> > transaction deadlock vector.
> >
> > Outside of the hole quirk, ext4 can avoid the flush for zero range
> > by using the recently introduced folio batch lookup mechanism for
> > unwritten mappings. Therefore, take the next logical step and lift
> > the hole handling logic into the XFS iomap_begin handler. iomap will
> > still flush on unwritten mappings without a folio batch, and XFS
> > will flush and retry mapping lookups in the case where it would
> > otherwise report a hole with dirty pagecache during a zero range.
> >
> > Note that this is intended to be a fairly straightforward lift and
> > otherwise not change behavior. Now that the flush exists within XFS,
> > follow on patches can further optimize it.
> >
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> > fs/iomap/buffered-io.c | 2 +-
> > fs/xfs/xfs_iomap.c | 25 ++++++++++++++++++++++---
> > 2 files changed, 23 insertions(+), 4 deletions(-)
> >
> > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > index 05ff82c5432e..d6de689374c3 100644
> > --- a/fs/iomap/buffered-io.c
> > +++ b/fs/iomap/buffered-io.c
> > @@ -1543,7 +1543,7 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
> > srcmap->type == IOMAP_UNWRITTEN)) {
> > s64 status;
> >
> > - if (range_dirty) {
> > + if (range_dirty && srcmap->type == IOMAP_UNWRITTEN) {
> > range_dirty = false;
> > status = iomap_zero_iter_flush_and_stale(&iter);
> > } else {
> > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > index 01833aca37ac..b84c94558cc9 100644
> > --- a/fs/xfs/xfs_iomap.c
> > +++ b/fs/xfs/xfs_iomap.c
> > @@ -1734,6 +1734,7 @@ xfs_buffered_write_iomap_begin(
> > if (error)
> > return error;
> >
> > +restart:
> > error = xfs_ilock_for_iomap(ip, flags, &lockmode);
> > if (error)
> > return error;
> > @@ -1761,9 +1762,27 @@ xfs_buffered_write_iomap_begin(
> > if (eof)
> > imap.br_startoff = end_fsb; /* fake hole until the end */
> >
> > - /* We never need to allocate blocks for zeroing or unsharing a hole. */
> > - if ((flags & (IOMAP_UNSHARE | IOMAP_ZERO)) &&
> > - imap.br_startoff > offset_fsb) {
> > + /* We never need to allocate blocks for unsharing a hole. */
> > + if ((flags & IOMAP_UNSHARE) && imap.br_startoff > offset_fsb) {
> > + xfs_hole_to_iomap(ip, iomap, offset_fsb, imap.br_startoff);
> > + goto out_unlock;
> > + }
> > +
> > + /*
> > + * We may need to zero over a hole in the data fork if it's fronted by
> > + * COW blocks and dirty pagecache. To make sure zeroing occurs, force
> > + * writeback to remap pending blocks and restart the lookup.
> > + */
> > + if ((flags & IOMAP_ZERO) && imap.br_startoff > offset_fsb) {
> > + if (filemap_range_needs_writeback(inode->i_mapping, offset,
> > + offset + count - 1)) {
> > + xfs_iunlock(ip, lockmode);
> > + error = filemap_write_and_wait_range(inode->i_mapping,
> > + offset, offset + count - 1);
> > + if (error)
> > + return error;
> > + goto restart;
> > + }
> > xfs_hole_to_iomap(ip, iomap, offset_fsb, imap.br_startoff);
> > goto out_unlock;
> > }
> > --
> > 2.51.0
> >
> >
>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 3/6] xfs: flush eof folio before insert range size update
2025-11-05 0:14 ` Darrick J. Wong
@ 2025-11-05 15:34 ` Brian Foster
2025-11-05 22:15 ` Darrick J. Wong
0 siblings, 1 reply; 27+ messages in thread
From: Brian Foster @ 2025-11-05 15:34 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: linux-fsdevel, linux-xfs
On Tue, Nov 04, 2025 at 04:14:45PM -0800, Darrick J. Wong wrote:
> On Thu, Oct 16, 2025 at 03:03:00PM -0400, Brian Foster wrote:
> > The flush in xfs_buffered_write_iomap_begin() for zero range over a
> > data fork hole fronted by COW fork prealloc is primarily designed to
> > provide correct zeroing behavior in particular pagecache conditions.
> > As it turns out, this also partially masks some odd behavior in
> > insert range (via zero range via setattr).
> >
> > Insert range bumps i_size the length of the new range, flushes,
> > unmaps pagecache and cancels COW prealloc, and then right shifts
> > extents from the end of the file back to the target offset of the
> > insert. Since the i_size update occurs before the pagecache flush,
> > this creates a transient situation where writeback around EOF can
> > behave differently.
>
> Why not flush the file from @offset to EOF, flush the COW
> preallocations, extend i_size, and only then start shifting extents?
> That would seem a lot more straightforward to me.
>
I agree. I noted in the cover letter that I started with this approach
of reordering the existing sequence of operations, but the factoring
looked ugly enough that I stopped and wanted to solicit input.
The details of that fell out of my brain since I posted this,
unfortunately. I suspect it may have been related to layering or
something wrt the prepare_shift factoring, but I'll take another look in
that direction for v2 and once I've got some feedback on the rest of the
series.. Thanks.
Brian
> --D
>
> > This appears to be corner case situation, but if happens to be
> > fronted by COW fork speculative preallocation and a large, dirty
> > folio that contains at least one full COW block beyond EOF, the
> > writeback after i_size is bumped may remap that COW fork block into
> > the data fork within EOF. The block is zeroed and then shifted back
> > out to post-eof, but this is unexpected in that it leads to a
> > written post-eof data fork block. This can cause a zero range
> > warning on a subsequent size extension, because we should never find
> > blocks that require physical zeroing beyond i_size.
> >
> > To avoid this quirk, flush the EOF folio before the i_size update
> > during insert range. The entire range will be flushed, unmapped and
> > invalidated anyways, so this should be relatively unnoticeable.
> >
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> > fs/xfs/xfs_file.c | 17 +++++++++++++++++
> > 1 file changed, 17 insertions(+)
> >
> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > index 5b9864c8582e..cc3a9674ad40 100644
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -1226,6 +1226,23 @@ xfs_falloc_insert_range(
> > if (offset >= isize)
> > return -EINVAL;
> >
> > + /*
> > + * Let writeback clean up EOF folio state before we bump i_size. The
> > + * insert flushes before it starts shifting and under certain
> > + * circumstances we can write back blocks that should technically be
> > + * considered post-eof (and thus should not be submitted for writeback).
> > + *
> > + * For example, a large, dirty folio that spans EOF and is backed by
> > + * post-eof COW fork preallocation can cause block remap into the data
> > + * fork. This shifts back out beyond EOF, but creates an expectedly
> > + * written post-eof block. The insert is going to flush, unmap and
> > + * cancel prealloc across this whole range, so flush EOF now before we
> > + * bump i_size to provide consistent behavior.
> > + */
> > + error = filemap_write_and_wait_range(inode->i_mapping, isize, isize);
> > + if (error)
> > + return error;
> > +
> > error = xfs_falloc_setsize(file, isize + len);
> > if (error)
> > return error;
> > --
> > 2.51.0
> >
> >
>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/6] iomap: replace folio_batch allocation with stack allocation
2025-11-05 15:27 ` Brian Foster
@ 2025-11-05 21:41 ` Darrick J. Wong
2025-11-06 15:51 ` Brian Foster
0 siblings, 1 reply; 27+ messages in thread
From: Darrick J. Wong @ 2025-11-05 21:41 UTC (permalink / raw)
To: Brian Foster; +Cc: linux-fsdevel, linux-xfs
On Wed, Nov 05, 2025 at 10:27:52AM -0500, Brian Foster wrote:
> On Tue, Nov 04, 2025 at 04:07:16PM -0800, Darrick J. Wong wrote:
> > On Thu, Oct 16, 2025 at 03:02:58PM -0400, Brian Foster wrote:
> > > Zhang Yi points out that the dynamic folio_batch allocation in
> > > iomap_fill_dirty_folios() is problematic for the ext4 on iomap work
> > > that is under development because it doesn't sufficiently handle the
> > > allocation failure case (by allowing a retry, for example).
> > >
> > > The dynamic allocation was initially added for simplicity and to
> > > help indicate whether the batch was used or not by the calling fs.
> > > To address this issue, put the batch on the stack of
> > > iomap_zero_range() and use a flag to control whether the batch
> > > should be used in the iomap folio lookup path. This keeps things
> > > simple and eliminates the concern for ext4 on iomap.
> > >
> > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> >
> > Hrmm, so who kmallocs the fbatch array now? Is that left as an exercise
> > to the filesystem? I'm confused because I don't see the kmalloc call
> > reappear elsewhere, at least not in this patch.
> >
>
> It's no longer dynamically allocated. It's allocated on the stack in
> iomap_zero_range(), and a flag is set on the iomap if the pagecache
> lookup occurs. The allocation is a potential problem for the ext4 on
> iomap port, so this elides the need for dealing with alloc failures and
> whatnot.
Oh, silly me. I got mixed up thinking that the deleted kmalloc was
allocating folio_batch::folios whereas in reality the kmalloc was for
the entire struct folio_batch. So yes, it's allocated on the stack.
Sorry about the noise.
--D
> This is probably how it should have been done from the start, but when
> the flag related feedback came along it was deep enough in test/review
> cycle that I preferred to do it separately.
>
> Brian
>
> > --D
> >
> > > ---
> > > fs/iomap/buffered-io.c | 45 ++++++++++++++++++++++++++++--------------
> > > fs/iomap/iter.c | 6 +++---
> > > fs/xfs/xfs_iomap.c | 11 ++++++-----
> > > include/linux/iomap.h | 8 ++++++--
> > > 4 files changed, 45 insertions(+), 25 deletions(-)
> > >
> > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > > index 51ecb6d48feb..05ff82c5432e 100644
> > > --- a/fs/iomap/buffered-io.c
> > > +++ b/fs/iomap/buffered-io.c
> > > @@ -761,7 +761,7 @@ static struct folio *__iomap_get_folio(struct iomap_iter *iter,
> > > if (!mapping_large_folio_support(iter->inode->i_mapping))
> > > len = min_t(size_t, len, PAGE_SIZE - offset_in_page(pos));
> > >
> > > - if (iter->fbatch) {
> > > + if (iter->iomap.flags & IOMAP_F_FOLIO_BATCH) {
> > > struct folio *folio = folio_batch_next(iter->fbatch);
> > >
> > > if (!folio)
> > > @@ -858,7 +858,7 @@ static int iomap_write_begin(struct iomap_iter *iter,
> > > * process so return and let the caller iterate and refill the batch.
> > > */
> > > if (!folio) {
> > > - WARN_ON_ONCE(!iter->fbatch);
> > > + WARN_ON_ONCE(!(iter->iomap.flags & IOMAP_F_FOLIO_BATCH));
> > > return 0;
> > > }
> > >
> > > @@ -1473,23 +1473,34 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero,
> > > return status;
> > > }
> > >
> > > -loff_t
> > > +/**
> > > + * iomap_fill_dirty_folios - fill a folio batch with dirty folios
> > > + * @iter: Iteration structure
> > > + * @start: Start offset of range. Updated based on lookup progress.
> > > + * @end: End offset of range
> > > + *
> > > + * Returns the associated control flag if the folio batch is available and the
> > > + * lookup performed. The caller is responsible to set the flag on the associated
> > > + * iomap.
> > > + */
> > > +unsigned int
> > > iomap_fill_dirty_folios(
> > > struct iomap_iter *iter,
> > > - loff_t offset,
> > > - loff_t length)
> > > + loff_t *start,
> > > + loff_t end)
> > > {
> > > struct address_space *mapping = iter->inode->i_mapping;
> > > - pgoff_t start = offset >> PAGE_SHIFT;
> > > - pgoff_t end = (offset + length - 1) >> PAGE_SHIFT;
> > > + pgoff_t pstart = *start >> PAGE_SHIFT;
> > > + pgoff_t pend = (end - 1) >> PAGE_SHIFT;
> > >
> > > - iter->fbatch = kmalloc(sizeof(struct folio_batch), GFP_KERNEL);
> > > - if (!iter->fbatch)
> > > - return offset + length;
> > > - folio_batch_init(iter->fbatch);
> > > + if (!iter->fbatch) {
> > > + *start = end;
> > > + return 0;
> > > + }
> > >
> > > - filemap_get_folios_dirty(mapping, &start, end, iter->fbatch);
> > > - return (start << PAGE_SHIFT);
> > > + filemap_get_folios_dirty(mapping, &pstart, pend, iter->fbatch);
> > > + *start = (pstart << PAGE_SHIFT);
> > > + return IOMAP_F_FOLIO_BATCH;
> > > }
> > > EXPORT_SYMBOL_GPL(iomap_fill_dirty_folios);
> > >
> > > @@ -1498,17 +1509,21 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
> > > const struct iomap_ops *ops,
> > > const struct iomap_write_ops *write_ops, void *private)
> > > {
> > > + struct folio_batch fbatch;
> > > struct iomap_iter iter = {
> > > .inode = inode,
> > > .pos = pos,
> > > .len = len,
> > > .flags = IOMAP_ZERO,
> > > .private = private,
> > > + .fbatch = &fbatch,
> > > };
> > > struct address_space *mapping = inode->i_mapping;
> > > int ret;
> > > bool range_dirty;
> > >
> > > + folio_batch_init(&fbatch);
> > > +
> > > /*
> > > * To avoid an unconditional flush, check pagecache state and only flush
> > > * if dirty and the fs returns a mapping that might convert on
> > > @@ -1519,11 +1534,11 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
> > > while ((ret = iomap_iter(&iter, ops)) > 0) {
> > > const struct iomap *srcmap = iomap_iter_srcmap(&iter);
> > >
> > > - if (WARN_ON_ONCE(iter.fbatch &&
> > > + if (WARN_ON_ONCE((iter.iomap.flags & IOMAP_F_FOLIO_BATCH) &&
> > > srcmap->type != IOMAP_UNWRITTEN))
> > > return -EIO;
> > >
> > > - if (!iter.fbatch &&
> > > + if (!(iter.iomap.flags & IOMAP_F_FOLIO_BATCH) &&
> > > (srcmap->type == IOMAP_HOLE ||
> > > srcmap->type == IOMAP_UNWRITTEN)) {
> > > s64 status;
> > > diff --git a/fs/iomap/iter.c b/fs/iomap/iter.c
> > > index 66ca12aac57d..026d85823c76 100644
> > > --- a/fs/iomap/iter.c
> > > +++ b/fs/iomap/iter.c
> > > @@ -8,10 +8,10 @@
> > >
> > > static inline void iomap_iter_reset_iomap(struct iomap_iter *iter)
> > > {
> > > - if (iter->fbatch) {
> > > + if (iter->iomap.flags & IOMAP_F_FOLIO_BATCH) {
> > > folio_batch_release(iter->fbatch);
> > > - kfree(iter->fbatch);
> > > - iter->fbatch = NULL;
> > > + folio_batch_reinit(iter->fbatch);
> > > + iter->iomap.flags &= ~IOMAP_F_FOLIO_BATCH;
> > > }
> > >
> > > iter->status = 0;
> > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > > index 535bf3b8705d..01833aca37ac 100644
> > > --- a/fs/xfs/xfs_iomap.c
> > > +++ b/fs/xfs/xfs_iomap.c
> > > @@ -1775,7 +1775,6 @@ xfs_buffered_write_iomap_begin(
> > > */
> > > if (flags & IOMAP_ZERO) {
> > > xfs_fileoff_t eof_fsb = XFS_B_TO_FSB(mp, XFS_ISIZE(ip));
> > > - u64 end;
> > >
> > > if (isnullstartblock(imap.br_startblock) &&
> > > offset_fsb >= eof_fsb)
> > > @@ -1795,12 +1794,14 @@ xfs_buffered_write_iomap_begin(
> > > */
> > > if (imap.br_state == XFS_EXT_UNWRITTEN &&
> > > offset_fsb < eof_fsb) {
> > > - loff_t len = min(count,
> > > - XFS_FSB_TO_B(mp, imap.br_blockcount));
> > > + loff_t foffset = offset, fend;
> > >
> > > - end = iomap_fill_dirty_folios(iter, offset, len);
> > > + fend = offset +
> > > + min(count, XFS_FSB_TO_B(mp, imap.br_blockcount));
> > > + iomap_flags |= iomap_fill_dirty_folios(iter, &foffset,
> > > + fend);
> > > end_fsb = min_t(xfs_fileoff_t, end_fsb,
> > > - XFS_B_TO_FSB(mp, end));
> > > + XFS_B_TO_FSB(mp, foffset));
> > > }
> > >
> > > xfs_trim_extent(&imap, offset_fsb, end_fsb - offset_fsb);
> > > diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> > > index cd0f573156d6..79da917ff45e 100644
> > > --- a/include/linux/iomap.h
> > > +++ b/include/linux/iomap.h
> > > @@ -87,6 +87,9 @@ struct vm_fault;
> > > /*
> > > * Flags set by the core iomap code during operations:
> > > *
> > > + * IOMAP_F_FOLIO_BATCH indicates that the folio batch mechanism is active
> > > + * for this operation, set by iomap_fill_dirty_folios().
> > > + *
> > > * IOMAP_F_SIZE_CHANGED indicates to the iomap_end method that the file size
> > > * has changed as the result of this write operation.
> > > *
> > > @@ -94,6 +97,7 @@ struct vm_fault;
> > > * range it covers needs to be remapped by the high level before the operation
> > > * can proceed.
> > > */
> > > +#define IOMAP_F_FOLIO_BATCH (1U << 13)
> > > #define IOMAP_F_SIZE_CHANGED (1U << 14)
> > > #define IOMAP_F_STALE (1U << 15)
> > >
> > > @@ -351,8 +355,8 @@ bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio);
> > > int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
> > > const struct iomap_ops *ops,
> > > const struct iomap_write_ops *write_ops);
> > > -loff_t iomap_fill_dirty_folios(struct iomap_iter *iter, loff_t offset,
> > > - loff_t length);
> > > +unsigned int iomap_fill_dirty_folios(struct iomap_iter *iter, loff_t *start,
> > > + loff_t end);
> > > int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len,
> > > bool *did_zero, const struct iomap_ops *ops,
> > > const struct iomap_write_ops *write_ops, void *private);
> > > --
> > > 2.51.0
> > >
> > >
> >
>
>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 3/6] xfs: flush eof folio before insert range size update
2025-11-05 15:34 ` Brian Foster
@ 2025-11-05 22:15 ` Darrick J. Wong
0 siblings, 0 replies; 27+ messages in thread
From: Darrick J. Wong @ 2025-11-05 22:15 UTC (permalink / raw)
To: Brian Foster; +Cc: linux-fsdevel, linux-xfs
On Wed, Nov 05, 2025 at 10:34:26AM -0500, Brian Foster wrote:
> On Tue, Nov 04, 2025 at 04:14:45PM -0800, Darrick J. Wong wrote:
> > On Thu, Oct 16, 2025 at 03:03:00PM -0400, Brian Foster wrote:
> > > The flush in xfs_buffered_write_iomap_begin() for zero range over a
> > > data fork hole fronted by COW fork prealloc is primarily designed to
> > > provide correct zeroing behavior in particular pagecache conditions.
> > > As it turns out, this also partially masks some odd behavior in
> > > insert range (via zero range via setattr).
> > >
> > > Insert range bumps i_size the length of the new range, flushes,
> > > unmaps pagecache and cancels COW prealloc, and then right shifts
> > > extents from the end of the file back to the target offset of the
> > > insert. Since the i_size update occurs before the pagecache flush,
> > > this creates a transient situation where writeback around EOF can
> > > behave differently.
> >
> > Why not flush the file from @offset to EOF, flush the COW
> > preallocations, extend i_size, and only then start shifting extents?
> > That would seem a lot more straightforward to me.
> >
>
> I agree. I noted in the cover letter that I started with this approach
> of reordering the existing sequence of operations, but the factoring
> looked ugly enough that I stopped and wanted to solicit input.
>
> The details of that fell out of my brain since I posted this,
> unfortunately. I suspect it may have been related to layering or
> something wrt the prepare_shift factoring, but I'll take another look in
> that direction for v2 and once I've got some feedback on the rest of the
> series.. Thanks.
I'm guessing that you want me to keep going with the other three
patches, then? :)
--D
> Brian
>
> > --D
> >
> > > This appears to be corner case situation, but if happens to be
> > > fronted by COW fork speculative preallocation and a large, dirty
> > > folio that contains at least one full COW block beyond EOF, the
> > > writeback after i_size is bumped may remap that COW fork block into
> > > the data fork within EOF. The block is zeroed and then shifted back
> > > out to post-eof, but this is unexpected in that it leads to a
> > > written post-eof data fork block. This can cause a zero range
> > > warning on a subsequent size extension, because we should never find
> > > blocks that require physical zeroing beyond i_size.
> > >
> > > To avoid this quirk, flush the EOF folio before the i_size update
> > > during insert range. The entire range will be flushed, unmapped and
> > > invalidated anyways, so this should be relatively unnoticeable.
> > >
> > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > ---
> > > fs/xfs/xfs_file.c | 17 +++++++++++++++++
> > > 1 file changed, 17 insertions(+)
> > >
> > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > > index 5b9864c8582e..cc3a9674ad40 100644
> > > --- a/fs/xfs/xfs_file.c
> > > +++ b/fs/xfs/xfs_file.c
> > > @@ -1226,6 +1226,23 @@ xfs_falloc_insert_range(
> > > if (offset >= isize)
> > > return -EINVAL;
> > >
> > > + /*
> > > + * Let writeback clean up EOF folio state before we bump i_size. The
> > > + * insert flushes before it starts shifting and under certain
> > > + * circumstances we can write back blocks that should technically be
> > > + * considered post-eof (and thus should not be submitted for writeback).
> > > + *
> > > + * For example, a large, dirty folio that spans EOF and is backed by
> > > + * post-eof COW fork preallocation can cause block remap into the data
> > > + * fork. This shifts back out beyond EOF, but creates an expectedly
> > > + * written post-eof block. The insert is going to flush, unmap and
> > > + * cancel prealloc across this whole range, so flush EOF now before we
> > > + * bump i_size to provide consistent behavior.
> > > + */
> > > + error = filemap_write_and_wait_range(inode->i_mapping, isize, isize);
> > > + if (error)
> > > + return error;
> > > +
> > > error = xfs_falloc_setsize(file, isize + len);
> > > if (error)
> > > return error;
> > > --
> > > 2.51.0
> > >
> > >
> >
>
>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 2/6] iomap, xfs: lift zero range hole mapping flush into xfs
2025-11-05 15:33 ` Brian Foster
@ 2025-11-05 22:23 ` Darrick J. Wong
2025-11-06 15:52 ` Brian Foster
2025-11-07 13:55 ` Christoph Hellwig
0 siblings, 2 replies; 27+ messages in thread
From: Darrick J. Wong @ 2025-11-05 22:23 UTC (permalink / raw)
To: Brian Foster; +Cc: linux-fsdevel, linux-xfs
On Wed, Nov 05, 2025 at 10:33:16AM -0500, Brian Foster wrote:
> On Tue, Nov 04, 2025 at 04:31:14PM -0800, Darrick J. Wong wrote:
> > On Thu, Oct 16, 2025 at 03:02:59PM -0400, Brian Foster wrote:
> > > iomap zero range has a wart in that it also flushes dirty pagecache
> > > over hole mappings (rather than only unwritten mappings). This was
> > > included to accommodate a quirk in XFS where COW fork preallocation
> > > can exist over a hole in the data fork, and the associated range is
> > > reported as a hole. This is because the range actually is a hole,
> > > but XFS also has an optimization where if COW fork blocks exist for
> > > a range being written to, those blocks are used regardless of
> > > whether the data fork blocks are shared or not. For zeroing, COW
> > > fork blocks over a data fork hole are only relevant if the range is
> > > dirty in pagecache, otherwise the range is already considered
> > > zeroed.
> >
> > It occurs to me that the situation (unwritten cow mapping, hole in data
> > fork) results in iomap_iter::iomap getting the unwritten mapping, and
> > iomap_iter::srcmap getting the hole mapping. iomap_iter_srcmap returns
> > iomap_itere::iomap because srcmap.type == HOLE.
> >
> > But then you have ext4 where there is no cow fork, so it will only ever
> > set iomap_iter::iomap, leaving iomap_iter::srcmap set to the default.
> > The default srcmap is a HOLE.
> >
> > So iomap can't distinguish between xfs' speculative cow over a hole
> > behavior vs. ext4 just being simple. I wonder if we actually need to
> > introduce a new iomap type for "pure overwrite"?
> >
>
> I definitely think we need a better solution here in iomap. The current
> iomap/srcmap management/handling is quite confusing. What that solution
> is, I'm not sure.
>
> > The reason I say that that in designing the fuse-iomap uapi, it was a
> > lot easier to understand the programming model if there was always
> > explicit read and write mappings being sent back and forth; and a new
> > type FUSE_IOMAP_TYPE_PURE_OVERWRITE that could be stored in the write
> > mapping to mean "just look at the read mapping". If such a beast were
> > ported to the core iomap code then maybe that would help here?
> >
>
> I'm not following what this means. Separate read/write mappings for each
> individual iomap operation (i.e. "read from here, write to there"), or
> separate iomap structures to be used for read ops vs. write ops, or
> something else..?
"read from here, write to there".
First, we move IOMAP_HOLE up one:
#define IOMAP_NULL 0 /* no mapping here at all */
#define IOMAP_HOLE 1 /* no blocks allocated, need allocation */
#define IOMAP_DELALLOC 2 /* delayed allocation blocks */
...
and do some renaming:
struct iomap_iter {
struct inode *inode;
loff_t pos;
u64 len;
loff_t iter_start_pos;
int status;
unsigned flags;
struct iomap write_map;
struct iomap read_map;
void *private;
};
Then we change the interface so that ->iomap_begin always sets read_map
to a mapping from which file data can be read, and write_map is always
set to a mapping into which file data can be written. If a filesystem
doesn't support out of place writes, then it can ignore write_map and
write_map.type will be IOMAP_NULL.
(Obviously the fs always has to supply a read mapping)
The read operations (e.g. readahead, fiemap) only ever pay attention to
what the filesystem supplies in iomap_iter::read_map.
An unaligned pagecache write to an uncached region uses the read mapping
to pull data into the pagecache. For writeback, we'd use the write
mapping if it's non-null, or else the read mapping.
This might not move the needle much wrt to fixing your problem, but at
least it eliminates the weirdness around "@iomap is for reads except
when you're doing a write but you have to do a read *and* @srcmap isn't
a hole".
> > A hole with an out-of-place mapping needs a flush (or maybe just go find
> > the pagecache and zero it), whereas a hole with nothing else backing it
> > clearly doesn't need any action at all.
> >
> > Does that help?
> >
>
> This kind of sounds like what we're already doing in iomap, so I suspect
> I'm missing something on the fuse side...
>
> WRT this patchset, I'm trying to address the underlying problems that
> require the flush-a-dirty-hole hack that provides zeroing correctness
> for XFS. This needs to be lifted out of iomap because it also causes
> problems for ext4, but can be bypassed completely for XFS as well by the
> end of the series. The first part is just a straight lift into xfs, but
> the next patches replace the flush with use of the folio batch, and
> split off the band-aid case down into insert range where this flush was
> also indirectly suppressing issues.
>
> For the former, if you look at the last few patches the main reason we
> rely on the flush-a-dirty-hole hack is that we don't actually report the
> mappings correctly in XFS for zero range with respect to allowable
> behavior. We just report a hole if one exists in the data fork. So this
> is trying to encode that "COW fork prealloc over data fork hole"
> scenario correctly for zero range, and also identify when we need to
> consider whether the mapping range is dirty in cache (i.e. unwritten COW
> blocks).
Ahh, ok. I think I get what you're saying now -- instead of doddering
around in iomap to make it know the difference between "prealloc in cow,
hole in data fork" vs. "hole in data fork", you'd rather try to
accumulate folios in the folio_batch, hand that to iomap, and then iomap
will just zero folios in the batch. No need for the preflush or deep
surgery to core code.
> So yes in general I think we need to improve on iomap reporting somehow,
> but I don't necessarily see how that avoids the need (or desire) to fix
> up the iomap_begin logic. I also think it's confusing enough that it
> should probably be a separate discussion (I'd probably need to stare at
> the fuse-related proposition to grok it).
>
> Ultimately the flush in zero range should go away completely except for
> the default/fallback case where the fs supports zero range, fails to
> check pagecache itself, and iomap has otherwise detected that the range
> over an unwritten mapping was dirty. There has been some discussion over
> potentially lifting the batch lookup into iomap as well, but there are
> some details that would need to be worked out to determine whether that
> can be done safely.
<nod> ok I'll keep reading this series.
--D
> Brian
>
> > --D
> >
> > > The easiest way to deal with this corner case is to flush the
> > > pagecache to trigger COW remapping into the data fork, and then
> > > operate on the updated on-disk state. The problem is that ext4
> > > cannot accommodate a flush from this context due to being a
> > > transaction deadlock vector.
> > >
> > > Outside of the hole quirk, ext4 can avoid the flush for zero range
> > > by using the recently introduced folio batch lookup mechanism for
> > > unwritten mappings. Therefore, take the next logical step and lift
> > > the hole handling logic into the XFS iomap_begin handler. iomap will
> > > still flush on unwritten mappings without a folio batch, and XFS
> > > will flush and retry mapping lookups in the case where it would
> > > otherwise report a hole with dirty pagecache during a zero range.
> > >
> > > Note that this is intended to be a fairly straightforward lift and
> > > otherwise not change behavior. Now that the flush exists within XFS,
> > > follow on patches can further optimize it.
> > >
> > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > ---
> > > fs/iomap/buffered-io.c | 2 +-
> > > fs/xfs/xfs_iomap.c | 25 ++++++++++++++++++++++---
> > > 2 files changed, 23 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > > index 05ff82c5432e..d6de689374c3 100644
> > > --- a/fs/iomap/buffered-io.c
> > > +++ b/fs/iomap/buffered-io.c
> > > @@ -1543,7 +1543,7 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
> > > srcmap->type == IOMAP_UNWRITTEN)) {
> > > s64 status;
> > >
> > > - if (range_dirty) {
> > > + if (range_dirty && srcmap->type == IOMAP_UNWRITTEN) {
> > > range_dirty = false;
> > > status = iomap_zero_iter_flush_and_stale(&iter);
> > > } else {
> > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > > index 01833aca37ac..b84c94558cc9 100644
> > > --- a/fs/xfs/xfs_iomap.c
> > > +++ b/fs/xfs/xfs_iomap.c
> > > @@ -1734,6 +1734,7 @@ xfs_buffered_write_iomap_begin(
> > > if (error)
> > > return error;
> > >
> > > +restart:
> > > error = xfs_ilock_for_iomap(ip, flags, &lockmode);
> > > if (error)
> > > return error;
> > > @@ -1761,9 +1762,27 @@ xfs_buffered_write_iomap_begin(
> > > if (eof)
> > > imap.br_startoff = end_fsb; /* fake hole until the end */
> > >
> > > - /* We never need to allocate blocks for zeroing or unsharing a hole. */
> > > - if ((flags & (IOMAP_UNSHARE | IOMAP_ZERO)) &&
> > > - imap.br_startoff > offset_fsb) {
> > > + /* We never need to allocate blocks for unsharing a hole. */
> > > + if ((flags & IOMAP_UNSHARE) && imap.br_startoff > offset_fsb) {
> > > + xfs_hole_to_iomap(ip, iomap, offset_fsb, imap.br_startoff);
> > > + goto out_unlock;
> > > + }
> > > +
> > > + /*
> > > + * We may need to zero over a hole in the data fork if it's fronted by
> > > + * COW blocks and dirty pagecache. To make sure zeroing occurs, force
> > > + * writeback to remap pending blocks and restart the lookup.
> > > + */
> > > + if ((flags & IOMAP_ZERO) && imap.br_startoff > offset_fsb) {
> > > + if (filemap_range_needs_writeback(inode->i_mapping, offset,
> > > + offset + count - 1)) {
> > > + xfs_iunlock(ip, lockmode);
> > > + error = filemap_write_and_wait_range(inode->i_mapping,
> > > + offset, offset + count - 1);
> > > + if (error)
> > > + return error;
> > > + goto restart;
> > > + }
> > > xfs_hole_to_iomap(ip, iomap, offset_fsb, imap.br_startoff);
> > > goto out_unlock;
> > > }
> > > --
> > > 2.51.0
> > >
> > >
> >
>
>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 4/6] xfs: look up cow fork extent earlier for buffered iomap_begin
2025-10-16 19:03 ` [PATCH 4/6] xfs: look up cow fork extent earlier for buffered iomap_begin Brian Foster
@ 2025-11-05 22:26 ` Darrick J. Wong
0 siblings, 0 replies; 27+ messages in thread
From: Darrick J. Wong @ 2025-11-05 22:26 UTC (permalink / raw)
To: Brian Foster; +Cc: linux-fsdevel, linux-xfs
On Thu, Oct 16, 2025 at 03:03:01PM -0400, Brian Foster wrote:
> To further isolate the need for flushing for zero range, we need to
> know whether a hole in the data fork is fronted by blocks in the COW
> fork or not. COW fork lookup currently occurs further down in the
> function, after the zero range case is handled.
>
> As a preparation step, lift the COW fork extent lookup to earlier in
> the function, at the same time as the data fork lookup. Only the
> lookup logic is lifted. The COW fork branch/reporting logic remains
> as is to avoid any observable behavior change from an iomap
> reporting perspective.
>
> Signed-off-by: Brian Foster <bfoster@redhat.com>
Looks fine to me,
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
--D
> ---
> fs/xfs/xfs_iomap.c | 46 +++++++++++++++++++++++++---------------------
> 1 file changed, 25 insertions(+), 21 deletions(-)
>
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index b84c94558cc9..ba5697d8b8fd 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -1753,14 +1753,29 @@ xfs_buffered_write_iomap_begin(
> goto out_unlock;
>
> /*
> - * Search the data fork first to look up our source mapping. We
> - * always need the data fork map, as we have to return it to the
> - * iomap code so that the higher level write code can read data in to
> - * perform read-modify-write cycles for unaligned writes.
> + * Search the data fork first to look up our source mapping. We always
> + * need the data fork map, as we have to return it to the iomap code so
> + * that the higher level write code can read data in to perform
> + * read-modify-write cycles for unaligned writes.
> + *
> + * Then search the COW fork extent list even if we did not find a data
> + * fork extent. This serves two purposes: first this implements the
> + * speculative preallocation using cowextsize, so that we also unshare
> + * block adjacent to shared blocks instead of just the shared blocks
> + * themselves. Second the lookup in the extent list is generally faster
> + * than going out to the shared extent tree.
> */
> eof = !xfs_iext_lookup_extent(ip, &ip->i_df, offset_fsb, &icur, &imap);
> if (eof)
> imap.br_startoff = end_fsb; /* fake hole until the end */
> + if (xfs_is_cow_inode(ip)) {
> + if (!ip->i_cowfp) {
> + ASSERT(!xfs_is_reflink_inode(ip));
> + xfs_ifork_init_cow(ip);
> + }
> + cow_eof = !xfs_iext_lookup_extent(ip, ip->i_cowfp, offset_fsb,
> + &ccur, &cmap);
> + }
>
> /* We never need to allocate blocks for unsharing a hole. */
> if ((flags & IOMAP_UNSHARE) && imap.br_startoff > offset_fsb) {
> @@ -1827,24 +1842,13 @@ xfs_buffered_write_iomap_begin(
> }
>
> /*
> - * Search the COW fork extent list even if we did not find a data fork
> - * extent. This serves two purposes: first this implements the
> - * speculative preallocation using cowextsize, so that we also unshare
> - * block adjacent to shared blocks instead of just the shared blocks
> - * themselves. Second the lookup in the extent list is generally faster
> - * than going out to the shared extent tree.
> + * Now that we've handled any operation specific special cases, at this
> + * point we can report a COW mapping if found.
> */
> - if (xfs_is_cow_inode(ip)) {
> - if (!ip->i_cowfp) {
> - ASSERT(!xfs_is_reflink_inode(ip));
> - xfs_ifork_init_cow(ip);
> - }
> - cow_eof = !xfs_iext_lookup_extent(ip, ip->i_cowfp, offset_fsb,
> - &ccur, &cmap);
> - if (!cow_eof && cmap.br_startoff <= offset_fsb) {
> - trace_xfs_reflink_cow_found(ip, &cmap);
> - goto found_cow;
> - }
> + if (xfs_is_cow_inode(ip) &&
> + !cow_eof && cmap.br_startoff <= offset_fsb) {
> + trace_xfs_reflink_cow_found(ip, &cmap);
> + goto found_cow;
> }
>
> if (imap.br_startoff <= offset_fsb) {
> --
> 2.51.0
>
>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 6/6] xfs: replace zero range flush with folio batch
2025-10-16 19:03 ` [PATCH 6/6] xfs: replace zero range flush with folio batch Brian Foster
@ 2025-11-05 22:37 ` Darrick J. Wong
2025-11-06 15:53 ` Brian Foster
0 siblings, 1 reply; 27+ messages in thread
From: Darrick J. Wong @ 2025-11-05 22:37 UTC (permalink / raw)
To: Brian Foster; +Cc: linux-fsdevel, linux-xfs
On Thu, Oct 16, 2025 at 03:03:03PM -0400, Brian Foster wrote:
> Now that the zero range pagecache flush is purely isolated to
> providing zeroing correctness in this case, we can remove it and
> replace it with the folio batch mechanism that is used for handling
> unwritten extents.
>
> This is still slightly odd in that XFS reports a hole vs. a mapping
> that reflects the COW fork extents, but that has always been the
> case in this situation and so a separate issue. We drop the iomap
> warning that assumes the folio batch is always associated with
> unwritten mappings, but this is mainly a development assertion as
> otherwise the core iomap fbatch code doesn't care much about the
> mapping type if it's handed the set of folios to process.
>
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
> fs/iomap/buffered-io.c | 4 ----
> fs/xfs/xfs_iomap.c | 16 ++++------------
> 2 files changed, 4 insertions(+), 16 deletions(-)
>
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index d6de689374c3..7bc4b8d090ee 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -1534,10 +1534,6 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
> while ((ret = iomap_iter(&iter, ops)) > 0) {
> const struct iomap *srcmap = iomap_iter_srcmap(&iter);
>
> - if (WARN_ON_ONCE((iter.iomap.flags & IOMAP_F_FOLIO_BATCH) &&
> - srcmap->type != IOMAP_UNWRITTEN))
> - return -EIO;
> -
> if (!(iter.iomap.flags & IOMAP_F_FOLIO_BATCH) &&
> (srcmap->type == IOMAP_HOLE ||
> srcmap->type == IOMAP_UNWRITTEN)) {
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 29f1462819fa..5a845a0ded79 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -1704,7 +1704,6 @@ xfs_buffered_write_iomap_begin(
> {
> struct iomap_iter *iter = container_of(iomap, struct iomap_iter,
> iomap);
> - struct address_space *mapping = inode->i_mapping;
> struct xfs_inode *ip = XFS_I(inode);
> struct xfs_mount *mp = ip->i_mount;
> xfs_fileoff_t offset_fsb = XFS_B_TO_FSBT(mp, offset);
> @@ -1736,7 +1735,6 @@ xfs_buffered_write_iomap_begin(
> if (error)
> return error;
>
> -restart:
> error = xfs_ilock_for_iomap(ip, flags, &lockmode);
> if (error)
> return error;
> @@ -1812,16 +1810,10 @@ xfs_buffered_write_iomap_begin(
> xfs_trim_extent(&imap, offset_fsb,
> cmap.br_startoff + cmap.br_blockcount - offset_fsb);
> start = XFS_FSB_TO_B(mp, imap.br_startoff);
> - end = XFS_FSB_TO_B(mp,
> - imap.br_startoff + imap.br_blockcount) - 1;
> - if (filemap_range_needs_writeback(mapping, start, end)) {
> - xfs_iunlock(ip, lockmode);
> - error = filemap_write_and_wait_range(mapping, start,
> - end);
> - if (error)
> - return error;
> - goto restart;
> - }
> + end = XFS_FSB_TO_B(mp, imap.br_startoff + imap.br_blockcount);
> + iomap_flags |= iomap_fill_dirty_folios(iter, &start, end);
> + xfs_trim_extent(&imap, offset_fsb,
> + XFS_B_TO_FSB(mp, start) - offset_fsb);
Hrm, ok. This replaces the pagecache flush with passing in folios and
letting iomap zero the folios regardless of whatever's in the mapping.
That seems to me like a reasonable way to solve the immediate problem
without the huge reengineering ->iomap_begin project.
The changes here mostly look ok to me, though I wonder how well this all
meshes with all the other iomap work headed to 6.19...
--D
>
> goto found_imap;
> }
> --
> 2.51.0
>
>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/6] iomap: replace folio_batch allocation with stack allocation
2025-11-05 21:41 ` Darrick J. Wong
@ 2025-11-06 15:51 ` Brian Foster
2025-11-06 15:58 ` Darrick J. Wong
0 siblings, 1 reply; 27+ messages in thread
From: Brian Foster @ 2025-11-06 15:51 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: linux-fsdevel, linux-xfs
On Wed, Nov 05, 2025 at 01:41:00PM -0800, Darrick J. Wong wrote:
> On Wed, Nov 05, 2025 at 10:27:52AM -0500, Brian Foster wrote:
> > On Tue, Nov 04, 2025 at 04:07:16PM -0800, Darrick J. Wong wrote:
> > > On Thu, Oct 16, 2025 at 03:02:58PM -0400, Brian Foster wrote:
> > > > Zhang Yi points out that the dynamic folio_batch allocation in
> > > > iomap_fill_dirty_folios() is problematic for the ext4 on iomap work
> > > > that is under development because it doesn't sufficiently handle the
> > > > allocation failure case (by allowing a retry, for example).
> > > >
> > > > The dynamic allocation was initially added for simplicity and to
> > > > help indicate whether the batch was used or not by the calling fs.
> > > > To address this issue, put the batch on the stack of
> > > > iomap_zero_range() and use a flag to control whether the batch
> > > > should be used in the iomap folio lookup path. This keeps things
> > > > simple and eliminates the concern for ext4 on iomap.
> > > >
> > > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > >
> > > Hrmm, so who kmallocs the fbatch array now? Is that left as an exercise
> > > to the filesystem? I'm confused because I don't see the kmalloc call
> > > reappear elsewhere, at least not in this patch.
> > >
> >
> > It's no longer dynamically allocated. It's allocated on the stack in
> > iomap_zero_range(), and a flag is set on the iomap if the pagecache
> > lookup occurs. The allocation is a potential problem for the ext4 on
> > iomap port, so this elides the need for dealing with alloc failures and
> > whatnot.
>
> Oh, silly me. I got mixed up thinking that the deleted kmalloc was
> allocating folio_batch::folios whereas in reality the kmalloc was for
> the entire struct folio_batch. So yes, it's allocated on the stack.
> Sorry about the noise.
>
Np.. This alloc was also just brought up here:
https://lore.kernel.org/linux-xfs/aQu8B63pEAzGRAkj@dread.disaster.area/
... so I'm wondering if this should be split off into a standalone patch
to try and land sooner than the rest (assuming I can get R-b's)..?
Brian
> --D
>
> > This is probably how it should have been done from the start, but when
> > the flag related feedback came along it was deep enough in test/review
> > cycle that I preferred to do it separately.
> >
> > Brian
> >
> > > --D
> > >
> > > > ---
> > > > fs/iomap/buffered-io.c | 45 ++++++++++++++++++++++++++++--------------
> > > > fs/iomap/iter.c | 6 +++---
> > > > fs/xfs/xfs_iomap.c | 11 ++++++-----
> > > > include/linux/iomap.h | 8 ++++++--
> > > > 4 files changed, 45 insertions(+), 25 deletions(-)
> > > >
> > > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > > > index 51ecb6d48feb..05ff82c5432e 100644
> > > > --- a/fs/iomap/buffered-io.c
> > > > +++ b/fs/iomap/buffered-io.c
> > > > @@ -761,7 +761,7 @@ static struct folio *__iomap_get_folio(struct iomap_iter *iter,
> > > > if (!mapping_large_folio_support(iter->inode->i_mapping))
> > > > len = min_t(size_t, len, PAGE_SIZE - offset_in_page(pos));
> > > >
> > > > - if (iter->fbatch) {
> > > > + if (iter->iomap.flags & IOMAP_F_FOLIO_BATCH) {
> > > > struct folio *folio = folio_batch_next(iter->fbatch);
> > > >
> > > > if (!folio)
> > > > @@ -858,7 +858,7 @@ static int iomap_write_begin(struct iomap_iter *iter,
> > > > * process so return and let the caller iterate and refill the batch.
> > > > */
> > > > if (!folio) {
> > > > - WARN_ON_ONCE(!iter->fbatch);
> > > > + WARN_ON_ONCE(!(iter->iomap.flags & IOMAP_F_FOLIO_BATCH));
> > > > return 0;
> > > > }
> > > >
> > > > @@ -1473,23 +1473,34 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero,
> > > > return status;
> > > > }
> > > >
> > > > -loff_t
> > > > +/**
> > > > + * iomap_fill_dirty_folios - fill a folio batch with dirty folios
> > > > + * @iter: Iteration structure
> > > > + * @start: Start offset of range. Updated based on lookup progress.
> > > > + * @end: End offset of range
> > > > + *
> > > > + * Returns the associated control flag if the folio batch is available and the
> > > > + * lookup performed. The caller is responsible to set the flag on the associated
> > > > + * iomap.
> > > > + */
> > > > +unsigned int
> > > > iomap_fill_dirty_folios(
> > > > struct iomap_iter *iter,
> > > > - loff_t offset,
> > > > - loff_t length)
> > > > + loff_t *start,
> > > > + loff_t end)
> > > > {
> > > > struct address_space *mapping = iter->inode->i_mapping;
> > > > - pgoff_t start = offset >> PAGE_SHIFT;
> > > > - pgoff_t end = (offset + length - 1) >> PAGE_SHIFT;
> > > > + pgoff_t pstart = *start >> PAGE_SHIFT;
> > > > + pgoff_t pend = (end - 1) >> PAGE_SHIFT;
> > > >
> > > > - iter->fbatch = kmalloc(sizeof(struct folio_batch), GFP_KERNEL);
> > > > - if (!iter->fbatch)
> > > > - return offset + length;
> > > > - folio_batch_init(iter->fbatch);
> > > > + if (!iter->fbatch) {
> > > > + *start = end;
> > > > + return 0;
> > > > + }
> > > >
> > > > - filemap_get_folios_dirty(mapping, &start, end, iter->fbatch);
> > > > - return (start << PAGE_SHIFT);
> > > > + filemap_get_folios_dirty(mapping, &pstart, pend, iter->fbatch);
> > > > + *start = (pstart << PAGE_SHIFT);
> > > > + return IOMAP_F_FOLIO_BATCH;
> > > > }
> > > > EXPORT_SYMBOL_GPL(iomap_fill_dirty_folios);
> > > >
> > > > @@ -1498,17 +1509,21 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
> > > > const struct iomap_ops *ops,
> > > > const struct iomap_write_ops *write_ops, void *private)
> > > > {
> > > > + struct folio_batch fbatch;
> > > > struct iomap_iter iter = {
> > > > .inode = inode,
> > > > .pos = pos,
> > > > .len = len,
> > > > .flags = IOMAP_ZERO,
> > > > .private = private,
> > > > + .fbatch = &fbatch,
> > > > };
> > > > struct address_space *mapping = inode->i_mapping;
> > > > int ret;
> > > > bool range_dirty;
> > > >
> > > > + folio_batch_init(&fbatch);
> > > > +
> > > > /*
> > > > * To avoid an unconditional flush, check pagecache state and only flush
> > > > * if dirty and the fs returns a mapping that might convert on
> > > > @@ -1519,11 +1534,11 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
> > > > while ((ret = iomap_iter(&iter, ops)) > 0) {
> > > > const struct iomap *srcmap = iomap_iter_srcmap(&iter);
> > > >
> > > > - if (WARN_ON_ONCE(iter.fbatch &&
> > > > + if (WARN_ON_ONCE((iter.iomap.flags & IOMAP_F_FOLIO_BATCH) &&
> > > > srcmap->type != IOMAP_UNWRITTEN))
> > > > return -EIO;
> > > >
> > > > - if (!iter.fbatch &&
> > > > + if (!(iter.iomap.flags & IOMAP_F_FOLIO_BATCH) &&
> > > > (srcmap->type == IOMAP_HOLE ||
> > > > srcmap->type == IOMAP_UNWRITTEN)) {
> > > > s64 status;
> > > > diff --git a/fs/iomap/iter.c b/fs/iomap/iter.c
> > > > index 66ca12aac57d..026d85823c76 100644
> > > > --- a/fs/iomap/iter.c
> > > > +++ b/fs/iomap/iter.c
> > > > @@ -8,10 +8,10 @@
> > > >
> > > > static inline void iomap_iter_reset_iomap(struct iomap_iter *iter)
> > > > {
> > > > - if (iter->fbatch) {
> > > > + if (iter->iomap.flags & IOMAP_F_FOLIO_BATCH) {
> > > > folio_batch_release(iter->fbatch);
> > > > - kfree(iter->fbatch);
> > > > - iter->fbatch = NULL;
> > > > + folio_batch_reinit(iter->fbatch);
> > > > + iter->iomap.flags &= ~IOMAP_F_FOLIO_BATCH;
> > > > }
> > > >
> > > > iter->status = 0;
> > > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > > > index 535bf3b8705d..01833aca37ac 100644
> > > > --- a/fs/xfs/xfs_iomap.c
> > > > +++ b/fs/xfs/xfs_iomap.c
> > > > @@ -1775,7 +1775,6 @@ xfs_buffered_write_iomap_begin(
> > > > */
> > > > if (flags & IOMAP_ZERO) {
> > > > xfs_fileoff_t eof_fsb = XFS_B_TO_FSB(mp, XFS_ISIZE(ip));
> > > > - u64 end;
> > > >
> > > > if (isnullstartblock(imap.br_startblock) &&
> > > > offset_fsb >= eof_fsb)
> > > > @@ -1795,12 +1794,14 @@ xfs_buffered_write_iomap_begin(
> > > > */
> > > > if (imap.br_state == XFS_EXT_UNWRITTEN &&
> > > > offset_fsb < eof_fsb) {
> > > > - loff_t len = min(count,
> > > > - XFS_FSB_TO_B(mp, imap.br_blockcount));
> > > > + loff_t foffset = offset, fend;
> > > >
> > > > - end = iomap_fill_dirty_folios(iter, offset, len);
> > > > + fend = offset +
> > > > + min(count, XFS_FSB_TO_B(mp, imap.br_blockcount));
> > > > + iomap_flags |= iomap_fill_dirty_folios(iter, &foffset,
> > > > + fend);
> > > > end_fsb = min_t(xfs_fileoff_t, end_fsb,
> > > > - XFS_B_TO_FSB(mp, end));
> > > > + XFS_B_TO_FSB(mp, foffset));
> > > > }
> > > >
> > > > xfs_trim_extent(&imap, offset_fsb, end_fsb - offset_fsb);
> > > > diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> > > > index cd0f573156d6..79da917ff45e 100644
> > > > --- a/include/linux/iomap.h
> > > > +++ b/include/linux/iomap.h
> > > > @@ -87,6 +87,9 @@ struct vm_fault;
> > > > /*
> > > > * Flags set by the core iomap code during operations:
> > > > *
> > > > + * IOMAP_F_FOLIO_BATCH indicates that the folio batch mechanism is active
> > > > + * for this operation, set by iomap_fill_dirty_folios().
> > > > + *
> > > > * IOMAP_F_SIZE_CHANGED indicates to the iomap_end method that the file size
> > > > * has changed as the result of this write operation.
> > > > *
> > > > @@ -94,6 +97,7 @@ struct vm_fault;
> > > > * range it covers needs to be remapped by the high level before the operation
> > > > * can proceed.
> > > > */
> > > > +#define IOMAP_F_FOLIO_BATCH (1U << 13)
> > > > #define IOMAP_F_SIZE_CHANGED (1U << 14)
> > > > #define IOMAP_F_STALE (1U << 15)
> > > >
> > > > @@ -351,8 +355,8 @@ bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio);
> > > > int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
> > > > const struct iomap_ops *ops,
> > > > const struct iomap_write_ops *write_ops);
> > > > -loff_t iomap_fill_dirty_folios(struct iomap_iter *iter, loff_t offset,
> > > > - loff_t length);
> > > > +unsigned int iomap_fill_dirty_folios(struct iomap_iter *iter, loff_t *start,
> > > > + loff_t end);
> > > > int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len,
> > > > bool *did_zero, const struct iomap_ops *ops,
> > > > const struct iomap_write_ops *write_ops, void *private);
> > > > --
> > > > 2.51.0
> > > >
> > > >
> > >
> >
> >
>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 2/6] iomap, xfs: lift zero range hole mapping flush into xfs
2025-11-05 22:23 ` Darrick J. Wong
@ 2025-11-06 15:52 ` Brian Foster
2025-11-06 23:32 ` Darrick J. Wong
2025-11-07 13:57 ` Christoph Hellwig
2025-11-07 13:55 ` Christoph Hellwig
1 sibling, 2 replies; 27+ messages in thread
From: Brian Foster @ 2025-11-06 15:52 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: linux-fsdevel, linux-xfs
On Wed, Nov 05, 2025 at 02:23:50PM -0800, Darrick J. Wong wrote:
> On Wed, Nov 05, 2025 at 10:33:16AM -0500, Brian Foster wrote:
> > On Tue, Nov 04, 2025 at 04:31:14PM -0800, Darrick J. Wong wrote:
> > > On Thu, Oct 16, 2025 at 03:02:59PM -0400, Brian Foster wrote:
> > > > iomap zero range has a wart in that it also flushes dirty pagecache
> > > > over hole mappings (rather than only unwritten mappings). This was
> > > > included to accommodate a quirk in XFS where COW fork preallocation
> > > > can exist over a hole in the data fork, and the associated range is
> > > > reported as a hole. This is because the range actually is a hole,
> > > > but XFS also has an optimization where if COW fork blocks exist for
> > > > a range being written to, those blocks are used regardless of
> > > > whether the data fork blocks are shared or not. For zeroing, COW
> > > > fork blocks over a data fork hole are only relevant if the range is
> > > > dirty in pagecache, otherwise the range is already considered
> > > > zeroed.
> > >
> > > It occurs to me that the situation (unwritten cow mapping, hole in data
> > > fork) results in iomap_iter::iomap getting the unwritten mapping, and
> > > iomap_iter::srcmap getting the hole mapping. iomap_iter_srcmap returns
> > > iomap_itere::iomap because srcmap.type == HOLE.
> > >
> > > But then you have ext4 where there is no cow fork, so it will only ever
> > > set iomap_iter::iomap, leaving iomap_iter::srcmap set to the default.
> > > The default srcmap is a HOLE.
> > >
> > > So iomap can't distinguish between xfs' speculative cow over a hole
> > > behavior vs. ext4 just being simple. I wonder if we actually need to
> > > introduce a new iomap type for "pure overwrite"?
> > >
> >
> > I definitely think we need a better solution here in iomap. The current
> > iomap/srcmap management/handling is quite confusing. What that solution
> > is, I'm not sure.
> >
> > > The reason I say that that in designing the fuse-iomap uapi, it was a
> > > lot easier to understand the programming model if there was always
> > > explicit read and write mappings being sent back and forth; and a new
> > > type FUSE_IOMAP_TYPE_PURE_OVERWRITE that could be stored in the write
> > > mapping to mean "just look at the read mapping". If such a beast were
> > > ported to the core iomap code then maybe that would help here?
> > >
> >
> > I'm not following what this means. Separate read/write mappings for each
> > individual iomap operation (i.e. "read from here, write to there"), or
> > separate iomap structures to be used for read ops vs. write ops, or
> > something else..?
>
> "read from here, write to there".
>
Ok..
> First, we move IOMAP_HOLE up one:
>
> #define IOMAP_NULL 0 /* no mapping here at all */
> #define IOMAP_HOLE 1 /* no blocks allocated, need allocation */
> #define IOMAP_DELALLOC 2 /* delayed allocation blocks */
> ...
>
> and do some renaming:
>
> struct iomap_iter {
> struct inode *inode;
> loff_t pos;
> u64 len;
> loff_t iter_start_pos;
> int status;
> unsigned flags;
> struct iomap write_map;
> struct iomap read_map;
> void *private;
> };
>
> Then we change the interface so that ->iomap_begin always sets read_map
> to a mapping from which file data can be read, and write_map is always
> set to a mapping into which file data can be written. If a filesystem
> doesn't support out of place writes, then it can ignore write_map and
> write_map.type will be IOMAP_NULL.
>
> (Obviously the fs always has to supply a read mapping)
>
> The read operations (e.g. readahead, fiemap) only ever pay attention to
> what the filesystem supplies in iomap_iter::read_map.
>
> An unaligned pagecache write to an uncached region uses the read mapping
> to pull data into the pagecache. For writeback, we'd use the write
> mapping if it's non-null, or else the read mapping.
>
Or perhaps let the read/write mappings overlap? It's not clear to me if
that's better or worse. ;P
> This might not move the needle much wrt to fixing your problem, but at
> least it eliminates the weirdness around "@iomap is for reads except
> when you're doing a write but you have to do a read *and* @srcmap isn't
> a hole".
>
Yeah.. it might be reaching a pedantic level, but to me having a couple
mappings that say "you can read from this range, write to that range,
and they might be the same" is more clear than the srcmap/dstmap/maybe
both logic we have today.
I'd be a little concerned about having similar complexity along the
lines of "read from here, write to there, except maybe sometimes write
to where you read from" in iomap if we could just make the logic dead
simple/consistent and give the fs helpers to fill things out properly.
But again this is my first exposure to the idea and so I'm thinking out
loud and probably missing some details..
> > > A hole with an out-of-place mapping needs a flush (or maybe just go find
> > > the pagecache and zero it), whereas a hole with nothing else backing it
> > > clearly doesn't need any action at all.
> > >
> > > Does that help?
> > >
> >
> > This kind of sounds like what we're already doing in iomap, so I suspect
> > I'm missing something on the fuse side...
> >
> > WRT this patchset, I'm trying to address the underlying problems that
> > require the flush-a-dirty-hole hack that provides zeroing correctness
> > for XFS. This needs to be lifted out of iomap because it also causes
> > problems for ext4, but can be bypassed completely for XFS as well by the
> > end of the series. The first part is just a straight lift into xfs, but
> > the next patches replace the flush with use of the folio batch, and
> > split off the band-aid case down into insert range where this flush was
> > also indirectly suppressing issues.
> >
> > For the former, if you look at the last few patches the main reason we
> > rely on the flush-a-dirty-hole hack is that we don't actually report the
> > mappings correctly in XFS for zero range with respect to allowable
> > behavior. We just report a hole if one exists in the data fork. So this
> > is trying to encode that "COW fork prealloc over data fork hole"
> > scenario correctly for zero range, and also identify when we need to
> > consider whether the mapping range is dirty in cache (i.e. unwritten COW
> > blocks).
>
> Ahh, ok. I think I get what you're saying now -- instead of doddering
> around in iomap to make it know the difference between "prealloc in cow,
> hole in data fork" vs. "hole in data fork", you'd rather try to
> accumulate folios in the folio_batch, hand that to iomap, and then iomap
> will just zero folios in the batch. No need for the preflush or deep
> surgery to core code.
>
Yeah, I don't want to preclude potential mapping improvements, but I'm
also not trying to solve that here just for zero range.
I suppose we could consider changing this particular case to also report
the cow mapping (similar to buffered write IIRC) and maybe that would
reduce some quirkiness, but I'd have to think harder and test that out.
It might probably still be a separate patch just because I'm hesitant to
change too much at once and hurt bisectability.
> > So yes in general I think we need to improve on iomap reporting somehow,
> > but I don't necessarily see how that avoids the need (or desire) to fix
> > up the iomap_begin logic. I also think it's confusing enough that it
> > should probably be a separate discussion (I'd probably need to stare at
> > the fuse-related proposition to grok it).
> >
> > Ultimately the flush in zero range should go away completely except for
> > the default/fallback case where the fs supports zero range, fails to
> > check pagecache itself, and iomap has otherwise detected that the range
> > over an unwritten mapping was dirty. There has been some discussion over
> > potentially lifting the batch lookup into iomap as well, but there are
> > some details that would need to be worked out to determine whether that
> > can be done safely.
>
> <nod> ok I'll keep reading this series.
>
Thanks.
Brian
> --D
>
> > Brian
> >
> > > --D
> > >
> > > > The easiest way to deal with this corner case is to flush the
> > > > pagecache to trigger COW remapping into the data fork, and then
> > > > operate on the updated on-disk state. The problem is that ext4
> > > > cannot accommodate a flush from this context due to being a
> > > > transaction deadlock vector.
> > > >
> > > > Outside of the hole quirk, ext4 can avoid the flush for zero range
> > > > by using the recently introduced folio batch lookup mechanism for
> > > > unwritten mappings. Therefore, take the next logical step and lift
> > > > the hole handling logic into the XFS iomap_begin handler. iomap will
> > > > still flush on unwritten mappings without a folio batch, and XFS
> > > > will flush and retry mapping lookups in the case where it would
> > > > otherwise report a hole with dirty pagecache during a zero range.
> > > >
> > > > Note that this is intended to be a fairly straightforward lift and
> > > > otherwise not change behavior. Now that the flush exists within XFS,
> > > > follow on patches can further optimize it.
> > > >
> > > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > > ---
> > > > fs/iomap/buffered-io.c | 2 +-
> > > > fs/xfs/xfs_iomap.c | 25 ++++++++++++++++++++++---
> > > > 2 files changed, 23 insertions(+), 4 deletions(-)
> > > >
> > > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > > > index 05ff82c5432e..d6de689374c3 100644
> > > > --- a/fs/iomap/buffered-io.c
> > > > +++ b/fs/iomap/buffered-io.c
> > > > @@ -1543,7 +1543,7 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
> > > > srcmap->type == IOMAP_UNWRITTEN)) {
> > > > s64 status;
> > > >
> > > > - if (range_dirty) {
> > > > + if (range_dirty && srcmap->type == IOMAP_UNWRITTEN) {
> > > > range_dirty = false;
> > > > status = iomap_zero_iter_flush_and_stale(&iter);
> > > > } else {
> > > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > > > index 01833aca37ac..b84c94558cc9 100644
> > > > --- a/fs/xfs/xfs_iomap.c
> > > > +++ b/fs/xfs/xfs_iomap.c
> > > > @@ -1734,6 +1734,7 @@ xfs_buffered_write_iomap_begin(
> > > > if (error)
> > > > return error;
> > > >
> > > > +restart:
> > > > error = xfs_ilock_for_iomap(ip, flags, &lockmode);
> > > > if (error)
> > > > return error;
> > > > @@ -1761,9 +1762,27 @@ xfs_buffered_write_iomap_begin(
> > > > if (eof)
> > > > imap.br_startoff = end_fsb; /* fake hole until the end */
> > > >
> > > > - /* We never need to allocate blocks for zeroing or unsharing a hole. */
> > > > - if ((flags & (IOMAP_UNSHARE | IOMAP_ZERO)) &&
> > > > - imap.br_startoff > offset_fsb) {
> > > > + /* We never need to allocate blocks for unsharing a hole. */
> > > > + if ((flags & IOMAP_UNSHARE) && imap.br_startoff > offset_fsb) {
> > > > + xfs_hole_to_iomap(ip, iomap, offset_fsb, imap.br_startoff);
> > > > + goto out_unlock;
> > > > + }
> > > > +
> > > > + /*
> > > > + * We may need to zero over a hole in the data fork if it's fronted by
> > > > + * COW blocks and dirty pagecache. To make sure zeroing occurs, force
> > > > + * writeback to remap pending blocks and restart the lookup.
> > > > + */
> > > > + if ((flags & IOMAP_ZERO) && imap.br_startoff > offset_fsb) {
> > > > + if (filemap_range_needs_writeback(inode->i_mapping, offset,
> > > > + offset + count - 1)) {
> > > > + xfs_iunlock(ip, lockmode);
> > > > + error = filemap_write_and_wait_range(inode->i_mapping,
> > > > + offset, offset + count - 1);
> > > > + if (error)
> > > > + return error;
> > > > + goto restart;
> > > > + }
> > > > xfs_hole_to_iomap(ip, iomap, offset_fsb, imap.br_startoff);
> > > > goto out_unlock;
> > > > }
> > > > --
> > > > 2.51.0
> > > >
> > > >
> > >
> >
> >
>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 6/6] xfs: replace zero range flush with folio batch
2025-11-05 22:37 ` Darrick J. Wong
@ 2025-11-06 15:53 ` Brian Foster
0 siblings, 0 replies; 27+ messages in thread
From: Brian Foster @ 2025-11-06 15:53 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: linux-fsdevel, linux-xfs
On Wed, Nov 05, 2025 at 02:37:15PM -0800, Darrick J. Wong wrote:
> On Thu, Oct 16, 2025 at 03:03:03PM -0400, Brian Foster wrote:
> > Now that the zero range pagecache flush is purely isolated to
> > providing zeroing correctness in this case, we can remove it and
> > replace it with the folio batch mechanism that is used for handling
> > unwritten extents.
> >
> > This is still slightly odd in that XFS reports a hole vs. a mapping
> > that reflects the COW fork extents, but that has always been the
> > case in this situation and so a separate issue. We drop the iomap
> > warning that assumes the folio batch is always associated with
> > unwritten mappings, but this is mainly a development assertion as
> > otherwise the core iomap fbatch code doesn't care much about the
> > mapping type if it's handed the set of folios to process.
> >
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> > fs/iomap/buffered-io.c | 4 ----
> > fs/xfs/xfs_iomap.c | 16 ++++------------
> > 2 files changed, 4 insertions(+), 16 deletions(-)
> >
> > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > index d6de689374c3..7bc4b8d090ee 100644
> > --- a/fs/iomap/buffered-io.c
> > +++ b/fs/iomap/buffered-io.c
> > @@ -1534,10 +1534,6 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
> > while ((ret = iomap_iter(&iter, ops)) > 0) {
> > const struct iomap *srcmap = iomap_iter_srcmap(&iter);
> >
> > - if (WARN_ON_ONCE((iter.iomap.flags & IOMAP_F_FOLIO_BATCH) &&
> > - srcmap->type != IOMAP_UNWRITTEN))
> > - return -EIO;
> > -
> > if (!(iter.iomap.flags & IOMAP_F_FOLIO_BATCH) &&
> > (srcmap->type == IOMAP_HOLE ||
> > srcmap->type == IOMAP_UNWRITTEN)) {
> > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > index 29f1462819fa..5a845a0ded79 100644
> > --- a/fs/xfs/xfs_iomap.c
> > +++ b/fs/xfs/xfs_iomap.c
> > @@ -1704,7 +1704,6 @@ xfs_buffered_write_iomap_begin(
> > {
> > struct iomap_iter *iter = container_of(iomap, struct iomap_iter,
> > iomap);
> > - struct address_space *mapping = inode->i_mapping;
> > struct xfs_inode *ip = XFS_I(inode);
> > struct xfs_mount *mp = ip->i_mount;
> > xfs_fileoff_t offset_fsb = XFS_B_TO_FSBT(mp, offset);
> > @@ -1736,7 +1735,6 @@ xfs_buffered_write_iomap_begin(
> > if (error)
> > return error;
> >
> > -restart:
> > error = xfs_ilock_for_iomap(ip, flags, &lockmode);
> > if (error)
> > return error;
> > @@ -1812,16 +1810,10 @@ xfs_buffered_write_iomap_begin(
> > xfs_trim_extent(&imap, offset_fsb,
> > cmap.br_startoff + cmap.br_blockcount - offset_fsb);
> > start = XFS_FSB_TO_B(mp, imap.br_startoff);
> > - end = XFS_FSB_TO_B(mp,
> > - imap.br_startoff + imap.br_blockcount) - 1;
> > - if (filemap_range_needs_writeback(mapping, start, end)) {
> > - xfs_iunlock(ip, lockmode);
> > - error = filemap_write_and_wait_range(mapping, start,
> > - end);
> > - if (error)
> > - return error;
> > - goto restart;
> > - }
> > + end = XFS_FSB_TO_B(mp, imap.br_startoff + imap.br_blockcount);
> > + iomap_flags |= iomap_fill_dirty_folios(iter, &start, end);
> > + xfs_trim_extent(&imap, offset_fsb,
> > + XFS_B_TO_FSB(mp, start) - offset_fsb);
>
> Hrm, ok. This replaces the pagecache flush with passing in folios and
> letting iomap zero the folios regardless of whatever's in the mapping.
> That seems to me like a reasonable way to solve the immediate problem
> without the huge reengineering ->iomap_begin project.
>
Yeah.. modulo a potential clean up to have this actually report the COW
mapping which might be incrementally cleaner. I'll take a closer look at
that when I get back to this..
> The changes here mostly look ok to me, though I wonder how well this all
> meshes with all the other iomap work headed to 6.19...
>
The batch stuff is already in -next I believe. It might be good to get
the first patch in sooner, but I don't think the rest should be too bad
provided it stays isolated to XFS and zero range.
Brian
> --D
>
> >
> > goto found_imap;
> > }
> > --
> > 2.51.0
> >
> >
>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/6] iomap: replace folio_batch allocation with stack allocation
2025-11-06 15:51 ` Brian Foster
@ 2025-11-06 15:58 ` Darrick J. Wong
0 siblings, 0 replies; 27+ messages in thread
From: Darrick J. Wong @ 2025-11-06 15:58 UTC (permalink / raw)
To: Brian Foster; +Cc: linux-fsdevel, linux-xfs
On Thu, Nov 06, 2025 at 10:51:06AM -0500, Brian Foster wrote:
> On Wed, Nov 05, 2025 at 01:41:00PM -0800, Darrick J. Wong wrote:
> > On Wed, Nov 05, 2025 at 10:27:52AM -0500, Brian Foster wrote:
> > > On Tue, Nov 04, 2025 at 04:07:16PM -0800, Darrick J. Wong wrote:
> > > > On Thu, Oct 16, 2025 at 03:02:58PM -0400, Brian Foster wrote:
> > > > > Zhang Yi points out that the dynamic folio_batch allocation in
> > > > > iomap_fill_dirty_folios() is problematic for the ext4 on iomap work
> > > > > that is under development because it doesn't sufficiently handle the
> > > > > allocation failure case (by allowing a retry, for example).
> > > > >
> > > > > The dynamic allocation was initially added for simplicity and to
> > > > > help indicate whether the batch was used or not by the calling fs.
> > > > > To address this issue, put the batch on the stack of
> > > > > iomap_zero_range() and use a flag to control whether the batch
> > > > > should be used in the iomap folio lookup path. This keeps things
> > > > > simple and eliminates the concern for ext4 on iomap.
> > > > >
> > > > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > >
> > > > Hrmm, so who kmallocs the fbatch array now? Is that left as an exercise
> > > > to the filesystem? I'm confused because I don't see the kmalloc call
> > > > reappear elsewhere, at least not in this patch.
> > > >
> > >
> > > It's no longer dynamically allocated. It's allocated on the stack in
> > > iomap_zero_range(), and a flag is set on the iomap if the pagecache
> > > lookup occurs. The allocation is a potential problem for the ext4 on
> > > iomap port, so this elides the need for dealing with alloc failures and
> > > whatnot.
> >
> > Oh, silly me. I got mixed up thinking that the deleted kmalloc was
> > allocating folio_batch::folios whereas in reality the kmalloc was for
> > the entire struct folio_batch. So yes, it's allocated on the stack.
> > Sorry about the noise.
> >
>
> Np.. This alloc was also just brought up here:
>
> https://lore.kernel.org/linux-xfs/aQu8B63pEAzGRAkj@dread.disaster.area/
>
> ... so I'm wondering if this should be split off into a standalone patch
> to try and land sooner than the rest (assuming I can get R-b's)..?
Yes, that's a good idea. I'll likely rvb it, and we can armtwist the
reporter to test it for us too. ;)
--D
> Brian
>
> > --D
> >
> > > This is probably how it should have been done from the start, but when
> > > the flag related feedback came along it was deep enough in test/review
> > > cycle that I preferred to do it separately.
> > >
> > > Brian
> > >
> > > > --D
> > > >
> > > > > ---
> > > > > fs/iomap/buffered-io.c | 45 ++++++++++++++++++++++++++++--------------
> > > > > fs/iomap/iter.c | 6 +++---
> > > > > fs/xfs/xfs_iomap.c | 11 ++++++-----
> > > > > include/linux/iomap.h | 8 ++++++--
> > > > > 4 files changed, 45 insertions(+), 25 deletions(-)
> > > > >
> > > > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > > > > index 51ecb6d48feb..05ff82c5432e 100644
> > > > > --- a/fs/iomap/buffered-io.c
> > > > > +++ b/fs/iomap/buffered-io.c
> > > > > @@ -761,7 +761,7 @@ static struct folio *__iomap_get_folio(struct iomap_iter *iter,
> > > > > if (!mapping_large_folio_support(iter->inode->i_mapping))
> > > > > len = min_t(size_t, len, PAGE_SIZE - offset_in_page(pos));
> > > > >
> > > > > - if (iter->fbatch) {
> > > > > + if (iter->iomap.flags & IOMAP_F_FOLIO_BATCH) {
> > > > > struct folio *folio = folio_batch_next(iter->fbatch);
> > > > >
> > > > > if (!folio)
> > > > > @@ -858,7 +858,7 @@ static int iomap_write_begin(struct iomap_iter *iter,
> > > > > * process so return and let the caller iterate and refill the batch.
> > > > > */
> > > > > if (!folio) {
> > > > > - WARN_ON_ONCE(!iter->fbatch);
> > > > > + WARN_ON_ONCE(!(iter->iomap.flags & IOMAP_F_FOLIO_BATCH));
> > > > > return 0;
> > > > > }
> > > > >
> > > > > @@ -1473,23 +1473,34 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero,
> > > > > return status;
> > > > > }
> > > > >
> > > > > -loff_t
> > > > > +/**
> > > > > + * iomap_fill_dirty_folios - fill a folio batch with dirty folios
> > > > > + * @iter: Iteration structure
> > > > > + * @start: Start offset of range. Updated based on lookup progress.
> > > > > + * @end: End offset of range
> > > > > + *
> > > > > + * Returns the associated control flag if the folio batch is available and the
> > > > > + * lookup performed. The caller is responsible to set the flag on the associated
> > > > > + * iomap.
> > > > > + */
> > > > > +unsigned int
> > > > > iomap_fill_dirty_folios(
> > > > > struct iomap_iter *iter,
> > > > > - loff_t offset,
> > > > > - loff_t length)
> > > > > + loff_t *start,
> > > > > + loff_t end)
> > > > > {
> > > > > struct address_space *mapping = iter->inode->i_mapping;
> > > > > - pgoff_t start = offset >> PAGE_SHIFT;
> > > > > - pgoff_t end = (offset + length - 1) >> PAGE_SHIFT;
> > > > > + pgoff_t pstart = *start >> PAGE_SHIFT;
> > > > > + pgoff_t pend = (end - 1) >> PAGE_SHIFT;
> > > > >
> > > > > - iter->fbatch = kmalloc(sizeof(struct folio_batch), GFP_KERNEL);
> > > > > - if (!iter->fbatch)
> > > > > - return offset + length;
> > > > > - folio_batch_init(iter->fbatch);
> > > > > + if (!iter->fbatch) {
> > > > > + *start = end;
> > > > > + return 0;
> > > > > + }
> > > > >
> > > > > - filemap_get_folios_dirty(mapping, &start, end, iter->fbatch);
> > > > > - return (start << PAGE_SHIFT);
> > > > > + filemap_get_folios_dirty(mapping, &pstart, pend, iter->fbatch);
> > > > > + *start = (pstart << PAGE_SHIFT);
> > > > > + return IOMAP_F_FOLIO_BATCH;
> > > > > }
> > > > > EXPORT_SYMBOL_GPL(iomap_fill_dirty_folios);
> > > > >
> > > > > @@ -1498,17 +1509,21 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
> > > > > const struct iomap_ops *ops,
> > > > > const struct iomap_write_ops *write_ops, void *private)
> > > > > {
> > > > > + struct folio_batch fbatch;
> > > > > struct iomap_iter iter = {
> > > > > .inode = inode,
> > > > > .pos = pos,
> > > > > .len = len,
> > > > > .flags = IOMAP_ZERO,
> > > > > .private = private,
> > > > > + .fbatch = &fbatch,
> > > > > };
> > > > > struct address_space *mapping = inode->i_mapping;
> > > > > int ret;
> > > > > bool range_dirty;
> > > > >
> > > > > + folio_batch_init(&fbatch);
> > > > > +
> > > > > /*
> > > > > * To avoid an unconditional flush, check pagecache state and only flush
> > > > > * if dirty and the fs returns a mapping that might convert on
> > > > > @@ -1519,11 +1534,11 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
> > > > > while ((ret = iomap_iter(&iter, ops)) > 0) {
> > > > > const struct iomap *srcmap = iomap_iter_srcmap(&iter);
> > > > >
> > > > > - if (WARN_ON_ONCE(iter.fbatch &&
> > > > > + if (WARN_ON_ONCE((iter.iomap.flags & IOMAP_F_FOLIO_BATCH) &&
> > > > > srcmap->type != IOMAP_UNWRITTEN))
> > > > > return -EIO;
> > > > >
> > > > > - if (!iter.fbatch &&
> > > > > + if (!(iter.iomap.flags & IOMAP_F_FOLIO_BATCH) &&
> > > > > (srcmap->type == IOMAP_HOLE ||
> > > > > srcmap->type == IOMAP_UNWRITTEN)) {
> > > > > s64 status;
> > > > > diff --git a/fs/iomap/iter.c b/fs/iomap/iter.c
> > > > > index 66ca12aac57d..026d85823c76 100644
> > > > > --- a/fs/iomap/iter.c
> > > > > +++ b/fs/iomap/iter.c
> > > > > @@ -8,10 +8,10 @@
> > > > >
> > > > > static inline void iomap_iter_reset_iomap(struct iomap_iter *iter)
> > > > > {
> > > > > - if (iter->fbatch) {
> > > > > + if (iter->iomap.flags & IOMAP_F_FOLIO_BATCH) {
> > > > > folio_batch_release(iter->fbatch);
> > > > > - kfree(iter->fbatch);
> > > > > - iter->fbatch = NULL;
> > > > > + folio_batch_reinit(iter->fbatch);
> > > > > + iter->iomap.flags &= ~IOMAP_F_FOLIO_BATCH;
> > > > > }
> > > > >
> > > > > iter->status = 0;
> > > > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > > > > index 535bf3b8705d..01833aca37ac 100644
> > > > > --- a/fs/xfs/xfs_iomap.c
> > > > > +++ b/fs/xfs/xfs_iomap.c
> > > > > @@ -1775,7 +1775,6 @@ xfs_buffered_write_iomap_begin(
> > > > > */
> > > > > if (flags & IOMAP_ZERO) {
> > > > > xfs_fileoff_t eof_fsb = XFS_B_TO_FSB(mp, XFS_ISIZE(ip));
> > > > > - u64 end;
> > > > >
> > > > > if (isnullstartblock(imap.br_startblock) &&
> > > > > offset_fsb >= eof_fsb)
> > > > > @@ -1795,12 +1794,14 @@ xfs_buffered_write_iomap_begin(
> > > > > */
> > > > > if (imap.br_state == XFS_EXT_UNWRITTEN &&
> > > > > offset_fsb < eof_fsb) {
> > > > > - loff_t len = min(count,
> > > > > - XFS_FSB_TO_B(mp, imap.br_blockcount));
> > > > > + loff_t foffset = offset, fend;
> > > > >
> > > > > - end = iomap_fill_dirty_folios(iter, offset, len);
> > > > > + fend = offset +
> > > > > + min(count, XFS_FSB_TO_B(mp, imap.br_blockcount));
> > > > > + iomap_flags |= iomap_fill_dirty_folios(iter, &foffset,
> > > > > + fend);
> > > > > end_fsb = min_t(xfs_fileoff_t, end_fsb,
> > > > > - XFS_B_TO_FSB(mp, end));
> > > > > + XFS_B_TO_FSB(mp, foffset));
> > > > > }
> > > > >
> > > > > xfs_trim_extent(&imap, offset_fsb, end_fsb - offset_fsb);
> > > > > diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> > > > > index cd0f573156d6..79da917ff45e 100644
> > > > > --- a/include/linux/iomap.h
> > > > > +++ b/include/linux/iomap.h
> > > > > @@ -87,6 +87,9 @@ struct vm_fault;
> > > > > /*
> > > > > * Flags set by the core iomap code during operations:
> > > > > *
> > > > > + * IOMAP_F_FOLIO_BATCH indicates that the folio batch mechanism is active
> > > > > + * for this operation, set by iomap_fill_dirty_folios().
> > > > > + *
> > > > > * IOMAP_F_SIZE_CHANGED indicates to the iomap_end method that the file size
> > > > > * has changed as the result of this write operation.
> > > > > *
> > > > > @@ -94,6 +97,7 @@ struct vm_fault;
> > > > > * range it covers needs to be remapped by the high level before the operation
> > > > > * can proceed.
> > > > > */
> > > > > +#define IOMAP_F_FOLIO_BATCH (1U << 13)
> > > > > #define IOMAP_F_SIZE_CHANGED (1U << 14)
> > > > > #define IOMAP_F_STALE (1U << 15)
> > > > >
> > > > > @@ -351,8 +355,8 @@ bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio);
> > > > > int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
> > > > > const struct iomap_ops *ops,
> > > > > const struct iomap_write_ops *write_ops);
> > > > > -loff_t iomap_fill_dirty_folios(struct iomap_iter *iter, loff_t offset,
> > > > > - loff_t length);
> > > > > +unsigned int iomap_fill_dirty_folios(struct iomap_iter *iter, loff_t *start,
> > > > > + loff_t end);
> > > > > int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len,
> > > > > bool *did_zero, const struct iomap_ops *ops,
> > > > > const struct iomap_write_ops *write_ops, void *private);
> > > > > --
> > > > > 2.51.0
> > > > >
> > > > >
> > > >
> > >
> > >
> >
>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 2/6] iomap, xfs: lift zero range hole mapping flush into xfs
2025-11-06 15:52 ` Brian Foster
@ 2025-11-06 23:32 ` Darrick J. Wong
2025-11-07 13:52 ` Brian Foster
2025-11-07 13:59 ` Christoph Hellwig
2025-11-07 13:57 ` Christoph Hellwig
1 sibling, 2 replies; 27+ messages in thread
From: Darrick J. Wong @ 2025-11-06 23:32 UTC (permalink / raw)
To: Brian Foster; +Cc: linux-fsdevel, linux-xfs
On Thu, Nov 06, 2025 at 10:52:34AM -0500, Brian Foster wrote:
> On Wed, Nov 05, 2025 at 02:23:50PM -0800, Darrick J. Wong wrote:
> > On Wed, Nov 05, 2025 at 10:33:16AM -0500, Brian Foster wrote:
> > > On Tue, Nov 04, 2025 at 04:31:14PM -0800, Darrick J. Wong wrote:
> > > > On Thu, Oct 16, 2025 at 03:02:59PM -0400, Brian Foster wrote:
> > > > > iomap zero range has a wart in that it also flushes dirty pagecache
> > > > > over hole mappings (rather than only unwritten mappings). This was
> > > > > included to accommodate a quirk in XFS where COW fork preallocation
> > > > > can exist over a hole in the data fork, and the associated range is
> > > > > reported as a hole. This is because the range actually is a hole,
> > > > > but XFS also has an optimization where if COW fork blocks exist for
> > > > > a range being written to, those blocks are used regardless of
> > > > > whether the data fork blocks are shared or not. For zeroing, COW
> > > > > fork blocks over a data fork hole are only relevant if the range is
> > > > > dirty in pagecache, otherwise the range is already considered
> > > > > zeroed.
> > > >
> > > > It occurs to me that the situation (unwritten cow mapping, hole in data
> > > > fork) results in iomap_iter::iomap getting the unwritten mapping, and
> > > > iomap_iter::srcmap getting the hole mapping. iomap_iter_srcmap returns
> > > > iomap_itere::iomap because srcmap.type == HOLE.
> > > >
> > > > But then you have ext4 where there is no cow fork, so it will only ever
> > > > set iomap_iter::iomap, leaving iomap_iter::srcmap set to the default.
> > > > The default srcmap is a HOLE.
> > > >
> > > > So iomap can't distinguish between xfs' speculative cow over a hole
> > > > behavior vs. ext4 just being simple. I wonder if we actually need to
> > > > introduce a new iomap type for "pure overwrite"?
> > > >
> > >
> > > I definitely think we need a better solution here in iomap. The current
> > > iomap/srcmap management/handling is quite confusing. What that solution
> > > is, I'm not sure.
> > >
> > > > The reason I say that that in designing the fuse-iomap uapi, it was a
> > > > lot easier to understand the programming model if there was always
> > > > explicit read and write mappings being sent back and forth; and a new
> > > > type FUSE_IOMAP_TYPE_PURE_OVERWRITE that could be stored in the write
> > > > mapping to mean "just look at the read mapping". If such a beast were
> > > > ported to the core iomap code then maybe that would help here?
> > > >
> > >
> > > I'm not following what this means. Separate read/write mappings for each
> > > individual iomap operation (i.e. "read from here, write to there"), or
> > > separate iomap structures to be used for read ops vs. write ops, or
> > > something else..?
> >
> > "read from here, write to there".
> >
>
> Ok..
>
> > First, we move IOMAP_HOLE up one:
> >
> > #define IOMAP_NULL 0 /* no mapping here at all */
> > #define IOMAP_HOLE 1 /* no blocks allocated, need allocation */
> > #define IOMAP_DELALLOC 2 /* delayed allocation blocks */
> > ...
> >
> > and do some renaming:
> >
> > struct iomap_iter {
> > struct inode *inode;
> > loff_t pos;
> > u64 len;
> > loff_t iter_start_pos;
> > int status;
> > unsigned flags;
> > struct iomap write_map;
> > struct iomap read_map;
> > void *private;
> > };
> >
> > Then we change the interface so that ->iomap_begin always sets read_map
> > to a mapping from which file data can be read, and write_map is always
> > set to a mapping into which file data can be written. If a filesystem
> > doesn't support out of place writes, then it can ignore write_map and
> > write_map.type will be IOMAP_NULL.
> >
> > (Obviously the fs always has to supply a read mapping)
> >
> > The read operations (e.g. readahead, fiemap) only ever pay attention to
> > what the filesystem supplies in iomap_iter::read_map.
> >
> > An unaligned pagecache write to an uncached region uses the read mapping
> > to pull data into the pagecache. For writeback, we'd use the write
> > mapping if it's non-null, or else the read mapping.
> >
>
> Or perhaps let the read/write mappings overlap? It's not clear to me if
> that's better or worse. ;P
>
> > This might not move the needle much wrt to fixing your problem, but at
> > least it eliminates the weirdness around "@iomap is for reads except
> > when you're doing a write but you have to do a read *and* @srcmap isn't
> > a hole".
> >
>
> Yeah.. it might be reaching a pedantic level, but to me having a couple
> mappings that say "you can read from this range, write to that range,
> and they might be the same" is more clear than the srcmap/dstmap/maybe
> both logic we have today.
>
> I'd be a little concerned about having similar complexity along the
> lines of "read from here, write to there, except maybe sometimes write
> to where you read from" in iomap if we could just make the logic dead
> simple/consistent and give the fs helpers to fill things out properly.
> But again this is my first exposure to the idea and so I'm thinking out
> loud and probably missing some details..
That's basically it. I thought about requiring the fuse server to fill
out both read and write iomaps, but decided the OVERWRITE_ONLY way was
easier because libfuse could initialize write_iomap to OVERWRITE_ONLY
which would be a "sane" default.
A harder question is: what would filesystems put in write_map for a pure
read? They could leave it at the default (null mapping) and the kernel
wouldn't pay attention to it.
> > > > A hole with an out-of-place mapping needs a flush (or maybe just go find
> > > > the pagecache and zero it), whereas a hole with nothing else backing it
> > > > clearly doesn't need any action at all.
> > > >
> > > > Does that help?
> > > >
> > >
> > > This kind of sounds like what we're already doing in iomap, so I suspect
> > > I'm missing something on the fuse side...
> > >
> > > WRT this patchset, I'm trying to address the underlying problems that
> > > require the flush-a-dirty-hole hack that provides zeroing correctness
> > > for XFS. This needs to be lifted out of iomap because it also causes
> > > problems for ext4, but can be bypassed completely for XFS as well by the
> > > end of the series. The first part is just a straight lift into xfs, but
> > > the next patches replace the flush with use of the folio batch, and
> > > split off the band-aid case down into insert range where this flush was
> > > also indirectly suppressing issues.
> > >
> > > For the former, if you look at the last few patches the main reason we
> > > rely on the flush-a-dirty-hole hack is that we don't actually report the
> > > mappings correctly in XFS for zero range with respect to allowable
> > > behavior. We just report a hole if one exists in the data fork. So this
> > > is trying to encode that "COW fork prealloc over data fork hole"
> > > scenario correctly for zero range, and also identify when we need to
> > > consider whether the mapping range is dirty in cache (i.e. unwritten COW
> > > blocks).
> >
> > Ahh, ok. I think I get what you're saying now -- instead of doddering
> > around in iomap to make it know the difference between "prealloc in cow,
> > hole in data fork" vs. "hole in data fork", you'd rather try to
> > accumulate folios in the folio_batch, hand that to iomap, and then iomap
> > will just zero folios in the batch. No need for the preflush or deep
> > surgery to core code.
> >
>
> Yeah, I don't want to preclude potential mapping improvements, but I'm
> also not trying to solve that here just for zero range.
<nod> The problem fixed by each of our patchsets is not a sidequest for
the other's. ;)
I think your patchset is fine, but remind me -- what happens if you
can't fill the whole folio batch? iomap just calls us back to get the
rest of the mapping and more folios? I think that's sufficient to fix
the problem.
> I suppose we could consider changing this particular case to also report
> the cow mapping (similar to buffered write IIRC) and maybe that would
> reduce some quirkiness, but I'd have to think harder and test that out.
> It might probably still be a separate patch just because I'm hesitant to
> change too much at once and hurt bisectability.
It never hurts to tack the experimental patches on the end of series, if
it would help answer hypothetical questions. Though those ought to be
marked as such.
--D
> > > So yes in general I think we need to improve on iomap reporting somehow,
> > > but I don't necessarily see how that avoids the need (or desire) to fix
> > > up the iomap_begin logic. I also think it's confusing enough that it
> > > should probably be a separate discussion (I'd probably need to stare at
> > > the fuse-related proposition to grok it).
> > >
> > > Ultimately the flush in zero range should go away completely except for
> > > the default/fallback case where the fs supports zero range, fails to
> > > check pagecache itself, and iomap has otherwise detected that the range
> > > over an unwritten mapping was dirty. There has been some discussion over
> > > potentially lifting the batch lookup into iomap as well, but there are
> > > some details that would need to be worked out to determine whether that
> > > can be done safely.
> >
> > <nod> ok I'll keep reading this series.
> >
>
> Thanks.
>
> Brian
>
> > --D
> >
> > > Brian
> > >
> > > > --D
> > > >
> > > > > The easiest way to deal with this corner case is to flush the
> > > > > pagecache to trigger COW remapping into the data fork, and then
> > > > > operate on the updated on-disk state. The problem is that ext4
> > > > > cannot accommodate a flush from this context due to being a
> > > > > transaction deadlock vector.
> > > > >
> > > > > Outside of the hole quirk, ext4 can avoid the flush for zero range
> > > > > by using the recently introduced folio batch lookup mechanism for
> > > > > unwritten mappings. Therefore, take the next logical step and lift
> > > > > the hole handling logic into the XFS iomap_begin handler. iomap will
> > > > > still flush on unwritten mappings without a folio batch, and XFS
> > > > > will flush and retry mapping lookups in the case where it would
> > > > > otherwise report a hole with dirty pagecache during a zero range.
> > > > >
> > > > > Note that this is intended to be a fairly straightforward lift and
> > > > > otherwise not change behavior. Now that the flush exists within XFS,
> > > > > follow on patches can further optimize it.
> > > > >
> > > > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > > > ---
> > > > > fs/iomap/buffered-io.c | 2 +-
> > > > > fs/xfs/xfs_iomap.c | 25 ++++++++++++++++++++++---
> > > > > 2 files changed, 23 insertions(+), 4 deletions(-)
> > > > >
> > > > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > > > > index 05ff82c5432e..d6de689374c3 100644
> > > > > --- a/fs/iomap/buffered-io.c
> > > > > +++ b/fs/iomap/buffered-io.c
> > > > > @@ -1543,7 +1543,7 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
> > > > > srcmap->type == IOMAP_UNWRITTEN)) {
> > > > > s64 status;
> > > > >
> > > > > - if (range_dirty) {
> > > > > + if (range_dirty && srcmap->type == IOMAP_UNWRITTEN) {
> > > > > range_dirty = false;
> > > > > status = iomap_zero_iter_flush_and_stale(&iter);
> > > > > } else {
> > > > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > > > > index 01833aca37ac..b84c94558cc9 100644
> > > > > --- a/fs/xfs/xfs_iomap.c
> > > > > +++ b/fs/xfs/xfs_iomap.c
> > > > > @@ -1734,6 +1734,7 @@ xfs_buffered_write_iomap_begin(
> > > > > if (error)
> > > > > return error;
> > > > >
> > > > > +restart:
> > > > > error = xfs_ilock_for_iomap(ip, flags, &lockmode);
> > > > > if (error)
> > > > > return error;
> > > > > @@ -1761,9 +1762,27 @@ xfs_buffered_write_iomap_begin(
> > > > > if (eof)
> > > > > imap.br_startoff = end_fsb; /* fake hole until the end */
> > > > >
> > > > > - /* We never need to allocate blocks for zeroing or unsharing a hole. */
> > > > > - if ((flags & (IOMAP_UNSHARE | IOMAP_ZERO)) &&
> > > > > - imap.br_startoff > offset_fsb) {
> > > > > + /* We never need to allocate blocks for unsharing a hole. */
> > > > > + if ((flags & IOMAP_UNSHARE) && imap.br_startoff > offset_fsb) {
> > > > > + xfs_hole_to_iomap(ip, iomap, offset_fsb, imap.br_startoff);
> > > > > + goto out_unlock;
> > > > > + }
> > > > > +
> > > > > + /*
> > > > > + * We may need to zero over a hole in the data fork if it's fronted by
> > > > > + * COW blocks and dirty pagecache. To make sure zeroing occurs, force
> > > > > + * writeback to remap pending blocks and restart the lookup.
> > > > > + */
> > > > > + if ((flags & IOMAP_ZERO) && imap.br_startoff > offset_fsb) {
> > > > > + if (filemap_range_needs_writeback(inode->i_mapping, offset,
> > > > > + offset + count - 1)) {
> > > > > + xfs_iunlock(ip, lockmode);
> > > > > + error = filemap_write_and_wait_range(inode->i_mapping,
> > > > > + offset, offset + count - 1);
> > > > > + if (error)
> > > > > + return error;
> > > > > + goto restart;
> > > > > + }
> > > > > xfs_hole_to_iomap(ip, iomap, offset_fsb, imap.br_startoff);
> > > > > goto out_unlock;
> > > > > }
> > > > > --
> > > > > 2.51.0
> > > > >
> > > > >
> > > >
> > >
> > >
> >
>
>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 2/6] iomap, xfs: lift zero range hole mapping flush into xfs
2025-11-06 23:32 ` Darrick J. Wong
@ 2025-11-07 13:52 ` Brian Foster
2025-11-07 13:59 ` Christoph Hellwig
1 sibling, 0 replies; 27+ messages in thread
From: Brian Foster @ 2025-11-07 13:52 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: linux-fsdevel, linux-xfs
On Thu, Nov 06, 2025 at 03:32:08PM -0800, Darrick J. Wong wrote:
> On Thu, Nov 06, 2025 at 10:52:34AM -0500, Brian Foster wrote:
> > On Wed, Nov 05, 2025 at 02:23:50PM -0800, Darrick J. Wong wrote:
> > > On Wed, Nov 05, 2025 at 10:33:16AM -0500, Brian Foster wrote:
> > > > On Tue, Nov 04, 2025 at 04:31:14PM -0800, Darrick J. Wong wrote:
> > > > > On Thu, Oct 16, 2025 at 03:02:59PM -0400, Brian Foster wrote:
> > > > > > iomap zero range has a wart in that it also flushes dirty pagecache
> > > > > > over hole mappings (rather than only unwritten mappings). This was
> > > > > > included to accommodate a quirk in XFS where COW fork preallocation
> > > > > > can exist over a hole in the data fork, and the associated range is
> > > > > > reported as a hole. This is because the range actually is a hole,
> > > > > > but XFS also has an optimization where if COW fork blocks exist for
> > > > > > a range being written to, those blocks are used regardless of
> > > > > > whether the data fork blocks are shared or not. For zeroing, COW
> > > > > > fork blocks over a data fork hole are only relevant if the range is
> > > > > > dirty in pagecache, otherwise the range is already considered
> > > > > > zeroed.
> > > > >
> > > > > It occurs to me that the situation (unwritten cow mapping, hole in data
> > > > > fork) results in iomap_iter::iomap getting the unwritten mapping, and
> > > > > iomap_iter::srcmap getting the hole mapping. iomap_iter_srcmap returns
> > > > > iomap_itere::iomap because srcmap.type == HOLE.
> > > > >
> > > > > But then you have ext4 where there is no cow fork, so it will only ever
> > > > > set iomap_iter::iomap, leaving iomap_iter::srcmap set to the default.
> > > > > The default srcmap is a HOLE.
> > > > >
> > > > > So iomap can't distinguish between xfs' speculative cow over a hole
> > > > > behavior vs. ext4 just being simple. I wonder if we actually need to
> > > > > introduce a new iomap type for "pure overwrite"?
> > > > >
> > > >
> > > > I definitely think we need a better solution here in iomap. The current
> > > > iomap/srcmap management/handling is quite confusing. What that solution
> > > > is, I'm not sure.
> > > >
> > > > > The reason I say that that in designing the fuse-iomap uapi, it was a
> > > > > lot easier to understand the programming model if there was always
> > > > > explicit read and write mappings being sent back and forth; and a new
> > > > > type FUSE_IOMAP_TYPE_PURE_OVERWRITE that could be stored in the write
> > > > > mapping to mean "just look at the read mapping". If such a beast were
> > > > > ported to the core iomap code then maybe that would help here?
> > > > >
> > > >
> > > > I'm not following what this means. Separate read/write mappings for each
> > > > individual iomap operation (i.e. "read from here, write to there"), or
> > > > separate iomap structures to be used for read ops vs. write ops, or
> > > > something else..?
> > >
> > > "read from here, write to there".
> > >
> >
> > Ok..
> >
> > > First, we move IOMAP_HOLE up one:
> > >
> > > #define IOMAP_NULL 0 /* no mapping here at all */
> > > #define IOMAP_HOLE 1 /* no blocks allocated, need allocation */
> > > #define IOMAP_DELALLOC 2 /* delayed allocation blocks */
> > > ...
> > >
> > > and do some renaming:
> > >
> > > struct iomap_iter {
> > > struct inode *inode;
> > > loff_t pos;
> > > u64 len;
> > > loff_t iter_start_pos;
> > > int status;
> > > unsigned flags;
> > > struct iomap write_map;
> > > struct iomap read_map;
> > > void *private;
> > > };
> > >
> > > Then we change the interface so that ->iomap_begin always sets read_map
> > > to a mapping from which file data can be read, and write_map is always
> > > set to a mapping into which file data can be written. If a filesystem
> > > doesn't support out of place writes, then it can ignore write_map and
> > > write_map.type will be IOMAP_NULL.
> > >
> > > (Obviously the fs always has to supply a read mapping)
> > >
> > > The read operations (e.g. readahead, fiemap) only ever pay attention to
> > > what the filesystem supplies in iomap_iter::read_map.
> > >
> > > An unaligned pagecache write to an uncached region uses the read mapping
> > > to pull data into the pagecache. For writeback, we'd use the write
> > > mapping if it's non-null, or else the read mapping.
> > >
> >
> > Or perhaps let the read/write mappings overlap? It's not clear to me if
> > that's better or worse. ;P
> >
> > > This might not move the needle much wrt to fixing your problem, but at
> > > least it eliminates the weirdness around "@iomap is for reads except
> > > when you're doing a write but you have to do a read *and* @srcmap isn't
> > > a hole".
> > >
> >
> > Yeah.. it might be reaching a pedantic level, but to me having a couple
> > mappings that say "you can read from this range, write to that range,
> > and they might be the same" is more clear than the srcmap/dstmap/maybe
> > both logic we have today.
> >
> > I'd be a little concerned about having similar complexity along the
> > lines of "read from here, write to there, except maybe sometimes write
> > to where you read from" in iomap if we could just make the logic dead
> > simple/consistent and give the fs helpers to fill things out properly.
> > But again this is my first exposure to the idea and so I'm thinking out
> > loud and probably missing some details..
>
> That's basically it. I thought about requiring the fuse server to fill
> out both read and write iomaps, but decided the OVERWRITE_ONLY way was
> easier because libfuse could initialize write_iomap to OVERWRITE_ONLY
> which would be a "sane" default.
>
Ok.
> A harder question is: what would filesystems put in write_map for a pure
> read? They could leave it at the default (null mapping) and the kernel
> wouldn't pay attention to it.
>
I'd figure the mappings would need to be set up properly for the
operation and iomap would do some up front verification or otherwise
bail out due to seeing something unexpected. So for a read I'd expect a
NULL write mapping or whatever, otherwise an EINVAL or warning or
something.
I don't know how painful that would be on the filesystem, but that's
kind of why I was wondering about helpers to fill out the mappings
properly (i.e. "fill for read, write, etc."). But maybe an iomap flag
analogous to the fuse overwrite example would accomplish the same thing
in terms of just making sure the fs is intentional in what is reported.
Another question I wonder about though is whether it's logically safer
to report an "overwriteable" read mapping or a "readable" write mapping
(interface bikeshedding in 3, 2.. ;)..
> > > > > A hole with an out-of-place mapping needs a flush (or maybe just go find
> > > > > the pagecache and zero it), whereas a hole with nothing else backing it
> > > > > clearly doesn't need any action at all.
> > > > >
> > > > > Does that help?
> > > > >
> > > >
> > > > This kind of sounds like what we're already doing in iomap, so I suspect
> > > > I'm missing something on the fuse side...
> > > >
> > > > WRT this patchset, I'm trying to address the underlying problems that
> > > > require the flush-a-dirty-hole hack that provides zeroing correctness
> > > > for XFS. This needs to be lifted out of iomap because it also causes
> > > > problems for ext4, but can be bypassed completely for XFS as well by the
> > > > end of the series. The first part is just a straight lift into xfs, but
> > > > the next patches replace the flush with use of the folio batch, and
> > > > split off the band-aid case down into insert range where this flush was
> > > > also indirectly suppressing issues.
> > > >
> > > > For the former, if you look at the last few patches the main reason we
> > > > rely on the flush-a-dirty-hole hack is that we don't actually report the
> > > > mappings correctly in XFS for zero range with respect to allowable
> > > > behavior. We just report a hole if one exists in the data fork. So this
> > > > is trying to encode that "COW fork prealloc over data fork hole"
> > > > scenario correctly for zero range, and also identify when we need to
> > > > consider whether the mapping range is dirty in cache (i.e. unwritten COW
> > > > blocks).
> > >
> > > Ahh, ok. I think I get what you're saying now -- instead of doddering
> > > around in iomap to make it know the difference between "prealloc in cow,
> > > hole in data fork" vs. "hole in data fork", you'd rather try to
> > > accumulate folios in the folio_batch, hand that to iomap, and then iomap
> > > will just zero folios in the batch. No need for the preflush or deep
> > > surgery to core code.
> > >
> >
> > Yeah, I don't want to preclude potential mapping improvements, but I'm
> > also not trying to solve that here just for zero range.
>
> <nod> The problem fixed by each of our patchsets is not a sidequest for
> the other's. ;)
>
> I think your patchset is fine, but remind me -- what happens if you
> can't fill the whole folio batch? iomap just calls us back to get the
> rest of the mapping and more folios? I think that's sufficient to fix
> the problem.
>
The flag indicates whether the batch is "valid" (i.e. lookup occurred),
so an empty (or sparse, with respect to the mapping) batch with an
unwritten mapping means the pagecache is clean and zeroing can skip it.
For batch full cases, the fs trims the mapping to the last folio in the
batch to ensure iomap cycles back through for the rest of the mapping
and folios. This isn't an issue in practice currently however because
zero range is pretty much relegated to single EOF folio handling cases
or where the filesystem already issues writeback in advance (i.e.
fallocate). This is part of the reason I introduced that force zeroing
error tag in XFS, so it would force the use of iomap zero range to help
provide more unit test coverage of the iomap mechanism.
> > I suppose we could consider changing this particular case to also report
> > the cow mapping (similar to buffered write IIRC) and maybe that would
> > reduce some quirkiness, but I'd have to think harder and test that out.
> > It might probably still be a separate patch just because I'm hesitant to
> > change too much at once and hurt bisectability.
>
> It never hurts to tack the experimental patches on the end of series, if
> it would help answer hypothetical questions. Though those ought to be
> marked as such.
>
Indeed. The question I have in mind is whether it might be worth
changing the reporting sooner and reordering that with lifting and
removing the flush. We end up in the same place either way so it's more
of a bisectability thing, but I want to take a closer look before
deciding on it. If there's limited value in reordering but it otherwise
functions correctly, I'd probably tack the patch on the end as a general
improvement..
Brian
> --D
>
> > > > So yes in general I think we need to improve on iomap reporting somehow,
> > > > but I don't necessarily see how that avoids the need (or desire) to fix
> > > > up the iomap_begin logic. I also think it's confusing enough that it
> > > > should probably be a separate discussion (I'd probably need to stare at
> > > > the fuse-related proposition to grok it).
> > > >
> > > > Ultimately the flush in zero range should go away completely except for
> > > > the default/fallback case where the fs supports zero range, fails to
> > > > check pagecache itself, and iomap has otherwise detected that the range
> > > > over an unwritten mapping was dirty. There has been some discussion over
> > > > potentially lifting the batch lookup into iomap as well, but there are
> > > > some details that would need to be worked out to determine whether that
> > > > can be done safely.
> > >
> > > <nod> ok I'll keep reading this series.
> > >
> >
> > Thanks.
> >
> > Brian
> >
> > > --D
> > >
> > > > Brian
> > > >
> > > > > --D
> > > > >
> > > > > > The easiest way to deal with this corner case is to flush the
> > > > > > pagecache to trigger COW remapping into the data fork, and then
> > > > > > operate on the updated on-disk state. The problem is that ext4
> > > > > > cannot accommodate a flush from this context due to being a
> > > > > > transaction deadlock vector.
> > > > > >
> > > > > > Outside of the hole quirk, ext4 can avoid the flush for zero range
> > > > > > by using the recently introduced folio batch lookup mechanism for
> > > > > > unwritten mappings. Therefore, take the next logical step and lift
> > > > > > the hole handling logic into the XFS iomap_begin handler. iomap will
> > > > > > still flush on unwritten mappings without a folio batch, and XFS
> > > > > > will flush and retry mapping lookups in the case where it would
> > > > > > otherwise report a hole with dirty pagecache during a zero range.
> > > > > >
> > > > > > Note that this is intended to be a fairly straightforward lift and
> > > > > > otherwise not change behavior. Now that the flush exists within XFS,
> > > > > > follow on patches can further optimize it.
> > > > > >
> > > > > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > > > > ---
> > > > > > fs/iomap/buffered-io.c | 2 +-
> > > > > > fs/xfs/xfs_iomap.c | 25 ++++++++++++++++++++++---
> > > > > > 2 files changed, 23 insertions(+), 4 deletions(-)
> > > > > >
> > > > > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > > > > > index 05ff82c5432e..d6de689374c3 100644
> > > > > > --- a/fs/iomap/buffered-io.c
> > > > > > +++ b/fs/iomap/buffered-io.c
> > > > > > @@ -1543,7 +1543,7 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
> > > > > > srcmap->type == IOMAP_UNWRITTEN)) {
> > > > > > s64 status;
> > > > > >
> > > > > > - if (range_dirty) {
> > > > > > + if (range_dirty && srcmap->type == IOMAP_UNWRITTEN) {
> > > > > > range_dirty = false;
> > > > > > status = iomap_zero_iter_flush_and_stale(&iter);
> > > > > > } else {
> > > > > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > > > > > index 01833aca37ac..b84c94558cc9 100644
> > > > > > --- a/fs/xfs/xfs_iomap.c
> > > > > > +++ b/fs/xfs/xfs_iomap.c
> > > > > > @@ -1734,6 +1734,7 @@ xfs_buffered_write_iomap_begin(
> > > > > > if (error)
> > > > > > return error;
> > > > > >
> > > > > > +restart:
> > > > > > error = xfs_ilock_for_iomap(ip, flags, &lockmode);
> > > > > > if (error)
> > > > > > return error;
> > > > > > @@ -1761,9 +1762,27 @@ xfs_buffered_write_iomap_begin(
> > > > > > if (eof)
> > > > > > imap.br_startoff = end_fsb; /* fake hole until the end */
> > > > > >
> > > > > > - /* We never need to allocate blocks for zeroing or unsharing a hole. */
> > > > > > - if ((flags & (IOMAP_UNSHARE | IOMAP_ZERO)) &&
> > > > > > - imap.br_startoff > offset_fsb) {
> > > > > > + /* We never need to allocate blocks for unsharing a hole. */
> > > > > > + if ((flags & IOMAP_UNSHARE) && imap.br_startoff > offset_fsb) {
> > > > > > + xfs_hole_to_iomap(ip, iomap, offset_fsb, imap.br_startoff);
> > > > > > + goto out_unlock;
> > > > > > + }
> > > > > > +
> > > > > > + /*
> > > > > > + * We may need to zero over a hole in the data fork if it's fronted by
> > > > > > + * COW blocks and dirty pagecache. To make sure zeroing occurs, force
> > > > > > + * writeback to remap pending blocks and restart the lookup.
> > > > > > + */
> > > > > > + if ((flags & IOMAP_ZERO) && imap.br_startoff > offset_fsb) {
> > > > > > + if (filemap_range_needs_writeback(inode->i_mapping, offset,
> > > > > > + offset + count - 1)) {
> > > > > > + xfs_iunlock(ip, lockmode);
> > > > > > + error = filemap_write_and_wait_range(inode->i_mapping,
> > > > > > + offset, offset + count - 1);
> > > > > > + if (error)
> > > > > > + return error;
> > > > > > + goto restart;
> > > > > > + }
> > > > > > xfs_hole_to_iomap(ip, iomap, offset_fsb, imap.br_startoff);
> > > > > > goto out_unlock;
> > > > > > }
> > > > > > --
> > > > > > 2.51.0
> > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > >
> >
> >
>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 2/6] iomap, xfs: lift zero range hole mapping flush into xfs
2025-11-05 22:23 ` Darrick J. Wong
2025-11-06 15:52 ` Brian Foster
@ 2025-11-07 13:55 ` Christoph Hellwig
1 sibling, 0 replies; 27+ messages in thread
From: Christoph Hellwig @ 2025-11-07 13:55 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: Brian Foster, linux-fsdevel, linux-xfs
On Wed, Nov 05, 2025 at 02:23:50PM -0800, Darrick J. Wong wrote:
> Then we change the interface so that ->iomap_begin always sets read_map
> to a mapping from which file data can be read, and write_map is always
> set to a mapping into which file data can be written. If a filesystem
> doesn't support out of place writes, then it can ignore write_map and
> write_map.type will be IOMAP_NULL.
Please don't. The whole two maps in one operation thing was a giant
mistake, let's not double down on that. The only places that actually
need it is the buffered write read-modify-write code, and the unshare
and zeroing code that reuse that in ugly ways. We need to untangle
that properly:
- for the buffered write read-modify-write just do a separate iomap
operation for the partial read
- for zeroing do a separate query for holes and for the overwrite
space allocation if the file system needs it due to COW
- same for unshare
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 2/6] iomap, xfs: lift zero range hole mapping flush into xfs
2025-11-06 15:52 ` Brian Foster
2025-11-06 23:32 ` Darrick J. Wong
@ 2025-11-07 13:57 ` Christoph Hellwig
1 sibling, 0 replies; 27+ messages in thread
From: Christoph Hellwig @ 2025-11-07 13:57 UTC (permalink / raw)
To: Brian Foster; +Cc: Darrick J. Wong, linux-fsdevel, linux-xfs
On Thu, Nov 06, 2025 at 10:52:34AM -0500, Brian Foster wrote:
> > An unaligned pagecache write to an uncached region uses the read mapping
> > to pull data into the pagecache. For writeback, we'd use the write
> > mapping if it's non-null, or else the read mapping.
> >
>
> Or perhaps let the read/write mappings overlap? It's not clear to me if
> that's better or worse. ;P
It is much better for the case where they actually are different.
> > This might not move the needle much wrt to fixing your problem, but at
> > least it eliminates the weirdness around "@iomap is for reads except
> > when you're doing a write but you have to do a read *and* @srcmap isn't
> > a hole".
> >
>
> Yeah.. it might be reaching a pedantic level, but to me having a couple
> mappings that say "you can read from this range, write to that range,
> and they might be the same" is more clear than the srcmap/dstmap/maybe
> both logic we have today.
It is a lot better.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 2/6] iomap, xfs: lift zero range hole mapping flush into xfs
2025-11-06 23:32 ` Darrick J. Wong
2025-11-07 13:52 ` Brian Foster
@ 2025-11-07 13:59 ` Christoph Hellwig
1 sibling, 0 replies; 27+ messages in thread
From: Christoph Hellwig @ 2025-11-07 13:59 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: Brian Foster, linux-fsdevel, linux-xfs
On Thu, Nov 06, 2025 at 03:32:08PM -0800, Darrick J. Wong wrote:
> A harder question is: what would filesystems put in write_map for a pure
> read? They could leave it at the default (null mapping) and the kernel
> wouldn't pay attention to it.
Please make them entirely separate iomap_iters, and allowing nesting.
That way no write map even exists for pure reads.
^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2025-11-07 13:59 UTC | newest]
Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-16 19:02 [PATCH 0/6] iomap, xfs: improve zero range flushing and lookup Brian Foster
2025-10-16 19:02 ` [PATCH 1/6] iomap: replace folio_batch allocation with stack allocation Brian Foster
2025-11-05 0:07 ` Darrick J. Wong
2025-11-05 15:27 ` Brian Foster
2025-11-05 21:41 ` Darrick J. Wong
2025-11-06 15:51 ` Brian Foster
2025-11-06 15:58 ` Darrick J. Wong
2025-10-16 19:02 ` [PATCH 2/6] iomap, xfs: lift zero range hole mapping flush into xfs Brian Foster
2025-11-05 0:31 ` Darrick J. Wong
2025-11-05 15:33 ` Brian Foster
2025-11-05 22:23 ` Darrick J. Wong
2025-11-06 15:52 ` Brian Foster
2025-11-06 23:32 ` Darrick J. Wong
2025-11-07 13:52 ` Brian Foster
2025-11-07 13:59 ` Christoph Hellwig
2025-11-07 13:57 ` Christoph Hellwig
2025-11-07 13:55 ` Christoph Hellwig
2025-10-16 19:03 ` [PATCH 3/6] xfs: flush eof folio before insert range size update Brian Foster
2025-11-05 0:14 ` Darrick J. Wong
2025-11-05 15:34 ` Brian Foster
2025-11-05 22:15 ` Darrick J. Wong
2025-10-16 19:03 ` [PATCH 4/6] xfs: look up cow fork extent earlier for buffered iomap_begin Brian Foster
2025-11-05 22:26 ` Darrick J. Wong
2025-10-16 19:03 ` [PATCH 5/6] xfs: only flush when COW fork blocks overlap data fork holes Brian Foster
2025-10-16 19:03 ` [PATCH 6/6] xfs: replace zero range flush with folio batch Brian Foster
2025-11-05 22:37 ` Darrick J. Wong
2025-11-06 15:53 ` Brian Foster
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).