[PATCH v3 0/7] iomap: zero range folio batch support

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/7] iomap: zero range folio batch support
@ 2025-07-14 20:41 Brian Foster
  2025-07-14 20:41 ` [PATCH v3 1/7] filemap: add helper to look up dirty folios in a range Brian Foster
                   ` (6 more replies)
  0 siblings, 7 replies; 37+ messages in thread
From: Brian Foster @ 2025-07-14 20:41 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-xfs, linux-mm, hch, djwong, willy

Hi all,

Quick update.. This series was held up by testing work on my end. I
don't have the custom test to go along with patch 7 yet, but hch was
asking for updates, I have vacation looming, and realistically I wasn't
going to get to that beforehand. So I'm posting v2 without the
additional test and reviewers can decide if/how to proceed in the
meantime. Either way, I'll pick up where this leaves off.

Zero range is still obviously functionally testable. We just don't yet
have the enhanced coverage I was hoping for via the errortag knobs.
There are also a couple small fstests failures related to to tests that
explicitly expect unwritten extents in cases where this now decides to
perform zeroing (generic/009, xfs/242). I don't consider these
functional regressions, but the tests need to be fixed up to accommodate
behavior. Again, I'll get back to this stuff either way, it's just going
to be a couple weeks or so at least at this point. Thanks.

Brian

--- Original cover letter ---

Hi all,

Here's a first real v1 of folio batch support for iomap. This initially
only targets zero range, the use case being zeroing of dirty folios over
unwritten mappings. There is potential to support other operations in
the future: iomap seek data/hole has similar raciness issues as zero
range, the prospect of using this for buffered write has been raised for
granular locking purposes, etc.

The one major caveat with this zero range implementation is that it
doesn't look at iomap_folio_state to determine whether to zero a
sub-folio portion of the folio. Instead it just relies on whether the
folio was dirty or not. This means that spurious zeroing of unwritten
ranges is possible if a folio is dirty but the target range includes a
subrange that is not.

The reasoning is that this is essentially a complexity tradeoff. The
current use cases for iomap_zero_range() are limited mostly to partial
block zeroing scenarios. It's relatively harmless to zero an unwritten
block (i.e. not a correctness issue), and this is something that
filesystems have done in the past without much notice or issue. The
advantage is less code and this makes it a little easier to use a
filemap lookup function for the batch rather than open coding more logic
in iomap. That said, this can probably be enhanced to look at ifs in the
future if the use case expands and/or other operations justify it.

WRT testing, I've tested with and without a local hack to redirect
fallocate zero range calls to iomap_zero_range() in XFS. This helps test
beyond the partial block/folio use case, i.e. to cover boundary
conditions like full folio batch handling, etc. I recently added patch 7
in spirit of that, which turns this logic into an XFS errortag. Further
comments on that are inline with patch 7.

Thoughts, reviews, flames appreciated.

Brian

v3:
- Update commit log description in patch 2.
- Improve comments in patch 7.
v2: https://lore.kernel.org/linux-fsdevel/20250714132059.288129-1-bfoster@redhat.com/
- Move filemap patch to top. Add some comments and drop export.
- Drop unnecessary BUG_ON()s from iomap_write_begin() instead of moving.
- Added folio mapping check to batch codepath, improved comments.
v1: https://lore.kernel.org/linux-fsdevel/20250605173357.579720-1-bfoster@redhat.com/
- Dropped most prep patches from previous version (merged separately).
- Reworked dirty folio lookup to use find_get_entry() loop (new patch
  for filemap helper).
- Misc. bug fixes, code cleanups, comments, etc.
- Added (RFC) prospective patch for wider zero range test coverage.
RFCv2: https://lore.kernel.org/linux-fsdevel/20241213150528.1003662-1-bfoster@redhat.com/
- Port onto incremental advance, drop patch 1 from RFCv1.
- Moved batch into iomap_iter, dynamically allocate and drop flag.
- Tweak XFS patch to always trim zero range on EOF boundary.
RFCv1: https://lore.kernel.org/linux-fsdevel/20241119154656.774395-1-bfoster@redhat.com/

Brian Foster (7):
  filemap: add helper to look up dirty folios in a range
  iomap: remove pos+len BUG_ON() to after folio lookup
  iomap: optional zero range dirty folio processing
  xfs: always trim mapping to requested range for zero range
  xfs: fill dirty folios on zero range of unwritten mappings
  iomap: remove old partial eof zeroing optimization
  xfs: error tag to force zeroing on debug kernels

 fs/iomap/buffered-io.c       | 116 +++++++++++++++++++++++++----------
 fs/iomap/iter.c              |   6 ++
 fs/xfs/libxfs/xfs_errortag.h |   4 +-
 fs/xfs/xfs_error.c           |   3 +
 fs/xfs/xfs_file.c            |  26 ++++++--
 fs/xfs/xfs_iomap.c           |  38 +++++++++---
 include/linux/iomap.h        |   4 ++
 include/linux/pagemap.h      |   2 +
 mm/filemap.c                 |  58 ++++++++++++++++++
 9 files changed, 210 insertions(+), 47 deletions(-)

-- 
2.50.0

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v3 1/7] filemap: add helper to look up dirty folios in a range
  2025-07-14 20:41 [PATCH v3 0/7] iomap: zero range folio batch support Brian Foster
@ 2025-07-14 20:41 ` Brian Foster
  2025-07-15  5:20   ` Darrick J. Wong
  2025-07-14 20:41 ` [PATCH v3 2/7] iomap: remove pos+len BUG_ON() to after folio lookup Brian Foster
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 37+ messages in thread
From: Brian Foster @ 2025-07-14 20:41 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-xfs, linux-mm, hch, djwong, willy

Add a new filemap_get_folios_dirty() helper to look up existing dirty
folios in a range and add them to a folio_batch. This is to support
optimization of certain iomap operations that only care about dirty
folios in a target range. For example, zero range only zeroes the subset
of dirty pages over unwritten mappings, seek hole/data may use similar
logic in the future, etc.

Note that the helper is intended for use under internal fs locks.
Therefore it trylocks folios in order to filter out clean folios.
This loosely follows the logic from filemap_range_has_writeback().

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 include/linux/pagemap.h |  2 ++
 mm/filemap.c            | 58 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 60 insertions(+)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e63fbfbd5b0f..fb83ddf26621 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -941,6 +941,8 @@ unsigned filemap_get_folios_contig(struct address_space *mapping,
 		pgoff_t *start, pgoff_t end, struct folio_batch *fbatch);
 unsigned filemap_get_folios_tag(struct address_space *mapping, pgoff_t *start,
 		pgoff_t end, xa_mark_t tag, struct folio_batch *fbatch);
+unsigned filemap_get_folios_dirty(struct address_space *mapping,
+		pgoff_t *start, pgoff_t end, struct folio_batch *fbatch);
 
 /*
  * Returns locked page at given index in given cache, creating it if needed.
diff --git a/mm/filemap.c b/mm/filemap.c
index bada249b9fb7..2171b7f689b0 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2334,6 +2334,64 @@ unsigned filemap_get_folios_tag(struct address_space *mapping, pgoff_t *start,
 }
 EXPORT_SYMBOL(filemap_get_folios_tag);
 
+/**
+ * filemap_get_folios_dirty - Get a batch of dirty folios
+ * @mapping:	The address_space to search
+ * @start:	The starting folio index
+ * @end:	The final folio index (inclusive)
+ * @fbatch:	The batch to fill
+ *
+ * filemap_get_folios_dirty() works exactly like filemap_get_folios(), except
+ * the returned folios are presumed to be dirty or undergoing writeback. Dirty
+ * state is presumed because we don't block on folio lock nor want to miss
+ * folios. Callers that need to can recheck state upon locking the folio.
+ *
+ * This may not return all dirty folios if the batch gets filled up.
+ *
+ * Return: The number of folios found.
+ * Also update @start to be positioned for traversal of the next folio.
+ */
+unsigned filemap_get_folios_dirty(struct address_space *mapping, pgoff_t *start,
+			pgoff_t end, struct folio_batch *fbatch)
+{
+	XA_STATE(xas, &mapping->i_pages, *start);
+	struct folio *folio;
+
+	rcu_read_lock();
+	while ((folio = find_get_entry(&xas, end, XA_PRESENT)) != NULL) {
+		if (xa_is_value(folio))
+			continue;
+		if (folio_trylock(folio)) {
+			bool clean = !folio_test_dirty(folio) &&
+				     !folio_test_writeback(folio);
+			folio_unlock(folio);
+			if (clean) {
+				folio_put(folio);
+				continue;
+			}
+		}
+		if (!folio_batch_add(fbatch, folio)) {
+			unsigned long nr = folio_nr_pages(folio);
+			*start = folio->index + nr;
+			goto out;
+		}
+	}
+	/*
+	 * We come here when there is no folio beyond @end. We take care to not
+	 * overflow the index @start as it confuses some of the callers. This
+	 * breaks the iteration when there is a folio at index -1 but that is
+	 * already broke anyway.
+	 */
+	if (end == (pgoff_t)-1)
+		*start = (pgoff_t)-1;
+	else
+		*start = end + 1;
+out:
+	rcu_read_unlock();
+
+	return folio_batch_count(fbatch);
+}
+
 /*
  * CD/DVDs are error prone. When a medium error occurs, the driver may fail
  * a _large_ part of the i/o request. Imagine the worst scenario:
-- 
2.50.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 2/7] iomap: remove pos+len BUG_ON() to after folio lookup
  2025-07-14 20:41 [PATCH v3 0/7] iomap: zero range folio batch support Brian Foster
  2025-07-14 20:41 ` [PATCH v3 1/7] filemap: add helper to look up dirty folios in a range Brian Foster
@ 2025-07-14 20:41 ` Brian Foster
  2025-07-15  5:14   ` Darrick J. Wong
  2025-07-14 20:41 ` [PATCH v3 3/7] iomap: optional zero range dirty folio processing Brian Foster
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 37+ messages in thread
From: Brian Foster @ 2025-07-14 20:41 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-xfs, linux-mm, hch, djwong, willy

The bug checks at the top of iomap_write_begin() assume the pos/len
reflect exactly the next range to process. This may no longer be the
case once the get folio path is able to process a folio batch from
the filesystem. On top of that, len is already trimmed to within the
iomap/srcmap by iomap_length(), so these checks aren't terribly
useful. Remove the unnecessary BUG_ON() checks.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap/buffered-io.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 3729391a18f3..38da2fa6e6b0 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -805,15 +805,12 @@ static int iomap_write_begin(struct iomap_iter *iter, struct folio **foliop,
 {
 	const struct iomap_folio_ops *folio_ops = iter->iomap.folio_ops;
 	const struct iomap *srcmap = iomap_iter_srcmap(iter);
-	loff_t pos = iter->pos;
+	loff_t pos;
 	u64 len = min_t(u64, SIZE_MAX, iomap_length(iter));
 	struct folio *folio;
 	int status = 0;
 
 	len = min_not_zero(len, *plen);
-	BUG_ON(pos + len > iter->iomap.offset + iter->iomap.length);
-	if (srcmap != &iter->iomap)
-		BUG_ON(pos + len > srcmap->offset + srcmap->length);
 
 	if (fatal_signal_pending(current))
 		return -EINTR;
-- 
2.50.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 3/7] iomap: optional zero range dirty folio processing
  2025-07-14 20:41 [PATCH v3 0/7] iomap: zero range folio batch support Brian Foster
  2025-07-14 20:41 ` [PATCH v3 1/7] filemap: add helper to look up dirty folios in a range Brian Foster
  2025-07-14 20:41 ` [PATCH v3 2/7] iomap: remove pos+len BUG_ON() to after folio lookup Brian Foster
@ 2025-07-14 20:41 ` Brian Foster
  2025-07-15  5:22   ` Darrick J. Wong
  2025-07-14 20:41 ` [PATCH v3 4/7] xfs: always trim mapping to requested range for zero range Brian Foster
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 37+ messages in thread
From: Brian Foster @ 2025-07-14 20:41 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-xfs, linux-mm, hch, djwong, willy

The only way zero range can currently process unwritten mappings
with dirty pagecache is to check whether the range is dirty before
mapping lookup and then flush when at least one underlying mapping
is unwritten. This ordering is required to prevent iomap lookup from
racing with folio writeback and reclaim.

Since zero range can skip ranges of unwritten mappings that are
clean in cache, this operation can be improved by allowing the
filesystem to provide a set of dirty folios that require zeroing. In
turn, rather than flush or iterate file offsets, zero range can
iterate on folios in the batch and advance over clean or uncached
ranges in between.

Add a folio_batch in struct iomap and provide a helper for fs' to
populate the batch at lookup time. Update the folio lookup path to
return the next folio in the batch, if provided, and advance the
iter if the folio starts beyond the current offset.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap/buffered-io.c | 89 +++++++++++++++++++++++++++++++++++++++---
 fs/iomap/iter.c        |  6 +++
 include/linux/iomap.h  |  4 ++
 3 files changed, 94 insertions(+), 5 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 38da2fa6e6b0..194e3cc0857f 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -750,6 +750,28 @@ static struct folio *__iomap_get_folio(struct iomap_iter *iter, size_t len)
 	if (!mapping_large_folio_support(iter->inode->i_mapping))
 		len = min_t(size_t, len, PAGE_SIZE - offset_in_page(pos));
 
+	if (iter->fbatch) {
+		struct folio *folio = folio_batch_next(iter->fbatch);
+
+		if (!folio)
+			return NULL;
+
+		/*
+		 * The folio mapping generally shouldn't have changed based on
+		 * fs locks, but be consistent with filemap lookup and retry
+		 * the iter if it does.
+		 */
+		folio_lock(folio);
+		if (unlikely(folio->mapping != iter->inode->i_mapping)) {
+			iter->iomap.flags |= IOMAP_F_STALE;
+			folio_unlock(folio);
+			return NULL;
+		}
+
+		folio_get(folio);
+		return folio;
+	}
+
 	if (folio_ops && folio_ops->get_folio)
 		return folio_ops->get_folio(iter, pos, len);
 	else
@@ -811,6 +833,8 @@ static int iomap_write_begin(struct iomap_iter *iter, struct folio **foliop,
 	int status = 0;
 
 	len = min_not_zero(len, *plen);
+	*foliop = NULL;
+	*plen = 0;
 
 	if (fatal_signal_pending(current))
 		return -EINTR;
@@ -819,6 +843,15 @@ static int iomap_write_begin(struct iomap_iter *iter, struct folio **foliop,
 	if (IS_ERR(folio))
 		return PTR_ERR(folio);
 
+	/*
+	 * No folio means we're done with a batch. We still have range to
+	 * process so return and let the caller iterate and refill the batch.
+	 */
+	if (!folio) {
+		WARN_ON_ONCE(!iter->fbatch);
+		return 0;
+	}
+
 	/*
 	 * Now we have a locked folio, before we do anything with it we need to
 	 * check that the iomap we have cached is not stale. The inode extent
@@ -839,6 +872,21 @@ static int iomap_write_begin(struct iomap_iter *iter, struct folio **foliop,
 		}
 	}
 
+	/*
+	 * The folios in a batch may not be contiguous. If we've skipped
+	 * forward, advance the iter to the pos of the current folio. If the
+	 * folio starts beyond the end of the mapping, it may have been trimmed
+	 * since the lookup for whatever reason. Return a NULL folio to
+	 * terminate the op.
+	 */
+	if (folio_pos(folio) > iter->pos) {
+		len = min_t(u64, folio_pos(folio) - iter->pos,
+				 iomap_length(iter));
+		status = iomap_iter_advance(iter, &len);
+		if (status || !len)
+			goto out_unlock;
+	}
+
 	pos = iomap_trim_folio_range(iter, folio, poffset, &len);
 
 	if (srcmap->type == IOMAP_INLINE)
@@ -1377,6 +1425,12 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
 		if (iter->iomap.flags & IOMAP_F_STALE)
 			break;
 
+		/* a NULL folio means we're done with a folio batch */
+		if (!folio) {
+			status = iomap_iter_advance_full(iter);
+			break;
+		}
+
 		/* warn about zeroing folios beyond eof that won't write back */
 		WARN_ON_ONCE(folio_pos(folio) > iter->inode->i_size);
 
@@ -1398,6 +1452,26 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
 	return status;
 }
 
+loff_t
+iomap_fill_dirty_folios(
+	struct iomap_iter	*iter,
+	loff_t			offset,
+	loff_t			length)
+{
+	struct address_space	*mapping = iter->inode->i_mapping;
+	pgoff_t			start = offset >> PAGE_SHIFT;
+	pgoff_t			end = (offset + length - 1) >> PAGE_SHIFT;
+
+	iter->fbatch = kmalloc(sizeof(struct folio_batch), GFP_KERNEL);
+	if (!iter->fbatch)
+		return offset + length;
+	folio_batch_init(iter->fbatch);
+
+	filemap_get_folios_dirty(mapping, &start, end, iter->fbatch);
+	return (start << PAGE_SHIFT);
+}
+EXPORT_SYMBOL_GPL(iomap_fill_dirty_folios);
+
 int
 iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
 		const struct iomap_ops *ops, void *private)
@@ -1426,7 +1500,7 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
 	 * flushing on partial eof zeroing, special case it to zero the
 	 * unaligned start portion if already dirty in pagecache.
 	 */
-	if (off &&
+	if (!iter.fbatch && off &&
 	    filemap_range_needs_writeback(mapping, pos, pos + plen - 1)) {
 		iter.len = plen;
 		while ((ret = iomap_iter(&iter, ops)) > 0)
@@ -1442,13 +1516,18 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
 	 * if dirty and the fs returns a mapping that might convert on
 	 * writeback.
 	 */
-	range_dirty = filemap_range_needs_writeback(inode->i_mapping,
-					iter.pos, iter.pos + iter.len - 1);
+	range_dirty = filemap_range_needs_writeback(mapping, iter.pos,
+					iter.pos + iter.len - 1);
 	while ((ret = iomap_iter(&iter, ops)) > 0) {
 		const struct iomap *srcmap = iomap_iter_srcmap(&iter);
 
-		if (srcmap->type == IOMAP_HOLE ||
-		    srcmap->type == IOMAP_UNWRITTEN) {
+		if (WARN_ON_ONCE(iter.fbatch &&
+				 srcmap->type != IOMAP_UNWRITTEN))
+			return -EIO;
+
+		if (!iter.fbatch &&
+		    (srcmap->type == IOMAP_HOLE ||
+		     srcmap->type == IOMAP_UNWRITTEN)) {
 			s64 status;
 
 			if (range_dirty) {
diff --git a/fs/iomap/iter.c b/fs/iomap/iter.c
index 6ffc6a7b9ba5..89bd5951a6fd 100644
--- a/fs/iomap/iter.c
+++ b/fs/iomap/iter.c
@@ -9,6 +9,12 @@
 
 static inline void iomap_iter_reset_iomap(struct iomap_iter *iter)
 {
+	if (iter->fbatch) {
+		folio_batch_release(iter->fbatch);
+		kfree(iter->fbatch);
+		iter->fbatch = NULL;
+	}
+
 	iter->status = 0;
 	memset(&iter->iomap, 0, sizeof(iter->iomap));
 	memset(&iter->srcmap, 0, sizeof(iter->srcmap));
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 522644d62f30..0b9b460b2873 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -9,6 +9,7 @@
 #include <linux/types.h>
 #include <linux/mm_types.h>
 #include <linux/blkdev.h>
+#include <linux/pagevec.h>
 
 struct address_space;
 struct fiemap_extent_info;
@@ -239,6 +240,7 @@ struct iomap_iter {
 	unsigned flags;
 	struct iomap iomap;
 	struct iomap srcmap;
+	struct folio_batch *fbatch;
 	void *private;
 };
 
@@ -345,6 +347,8 @@ void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len);
 bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio);
 int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
 		const struct iomap_ops *ops);
+loff_t iomap_fill_dirty_folios(struct iomap_iter *iter, loff_t offset,
+		loff_t length);
 int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len,
 		bool *did_zero, const struct iomap_ops *ops, void *private);
 int iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
-- 
2.50.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 4/7] xfs: always trim mapping to requested range for zero range
  2025-07-14 20:41 [PATCH v3 0/7] iomap: zero range folio batch support Brian Foster
                   ` (2 preceding siblings ...)
  2025-07-14 20:41 ` [PATCH v3 3/7] iomap: optional zero range dirty folio processing Brian Foster
@ 2025-07-14 20:41 ` Brian Foster
  2025-07-14 20:41 ` [PATCH v3 5/7] xfs: fill dirty folios on zero range of unwritten mappings Brian Foster
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 37+ messages in thread
From: Brian Foster @ 2025-07-14 20:41 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-xfs, linux-mm, hch, djwong, willy

Refactor and tweak the IOMAP_ZERO logic in preparation to support
filling the folio batch for unwritten mappings. Drop the superfluous
imap offset check since the hole case has already been filtered out.
Split the the delalloc case handling into a sub-branch, and always
trim the imap to the requested offset/count so it can be more easily
used to bound the range to lookup in pagecache.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/xfs_iomap.c | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index ff05e6b1b0bb..b5cf5bc6308d 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1756,21 +1756,20 @@ xfs_buffered_write_iomap_begin(
 	}
 
 	/*
-	 * For zeroing, trim a delalloc extent that extends beyond the EOF
-	 * block.  If it starts beyond the EOF block, convert it to an
+	 * For zeroing, trim extents that extend beyond the EOF block. If a
+	 * delalloc extent starts beyond the EOF block, convert it to an
 	 * unwritten extent.
 	 */
-	if ((flags & IOMAP_ZERO) && imap.br_startoff <= offset_fsb &&
-	    isnullstartblock(imap.br_startblock)) {
+	if (flags & IOMAP_ZERO) {
 		xfs_fileoff_t eof_fsb = XFS_B_TO_FSB(mp, XFS_ISIZE(ip));
 
-		if (offset_fsb >= eof_fsb)
+		if (isnullstartblock(imap.br_startblock) &&
+		    offset_fsb >= eof_fsb)
 			goto convert_delay;
-		if (end_fsb > eof_fsb) {
+		if (offset_fsb < eof_fsb && end_fsb > eof_fsb)
 			end_fsb = eof_fsb;
-			xfs_trim_extent(&imap, offset_fsb,
-					end_fsb - offset_fsb);
-		}
+
+		xfs_trim_extent(&imap, offset_fsb, end_fsb - offset_fsb);
 	}
 
 	/*
-- 
2.50.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 5/7] xfs: fill dirty folios on zero range of unwritten mappings
  2025-07-14 20:41 [PATCH v3 0/7] iomap: zero range folio batch support Brian Foster
                   ` (3 preceding siblings ...)
  2025-07-14 20:41 ` [PATCH v3 4/7] xfs: always trim mapping to requested range for zero range Brian Foster
@ 2025-07-14 20:41 ` Brian Foster
  2025-07-15  5:28   ` Darrick J. Wong
  2025-07-14 20:41 ` [PATCH v3 6/7] iomap: remove old partial eof zeroing optimization Brian Foster
  2025-07-14 20:41 ` [PATCH v3 7/7] xfs: error tag to force zeroing on debug kernels Brian Foster
  6 siblings, 1 reply; 37+ messages in thread
From: Brian Foster @ 2025-07-14 20:41 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-xfs, linux-mm, hch, djwong, willy

Use the iomap folio batch mechanism to select folios to zero on zero
range of unwritten mappings. Trim the resulting mapping if the batch
is filled (unlikely for current use cases) to distinguish between a
range to skip and one that requires another iteration due to a full
batch.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_iomap.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index b5cf5bc6308d..63054f7ead0e 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1691,6 +1691,8 @@ xfs_buffered_write_iomap_begin(
 	struct iomap		*iomap,
 	struct iomap		*srcmap)
 {
+	struct iomap_iter	*iter = container_of(iomap, struct iomap_iter,
+						     iomap);
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct xfs_mount	*mp = ip->i_mount;
 	xfs_fileoff_t		offset_fsb = XFS_B_TO_FSBT(mp, offset);
@@ -1762,6 +1764,7 @@ xfs_buffered_write_iomap_begin(
 	 */
 	if (flags & IOMAP_ZERO) {
 		xfs_fileoff_t eof_fsb = XFS_B_TO_FSB(mp, XFS_ISIZE(ip));
+		u64 end;
 
 		if (isnullstartblock(imap.br_startblock) &&
 		    offset_fsb >= eof_fsb)
@@ -1769,6 +1772,26 @@ xfs_buffered_write_iomap_begin(
 		if (offset_fsb < eof_fsb && end_fsb > eof_fsb)
 			end_fsb = eof_fsb;
 
+		/*
+		 * Look up dirty folios for unwritten mappings within EOF.
+		 * Providing this bypasses the flush iomap uses to trigger
+		 * extent conversion when unwritten mappings have dirty
+		 * pagecache in need of zeroing.
+		 *
+		 * Trim the mapping to the end pos of the lookup, which in turn
+		 * was trimmed to the end of the batch if it became full before
+		 * the end of the mapping.
+		 */
+		if (imap.br_state == XFS_EXT_UNWRITTEN &&
+		    offset_fsb < eof_fsb) {
+			loff_t len = min(count,
+					 XFS_FSB_TO_B(mp, imap.br_blockcount));
+
+			end = iomap_fill_dirty_folios(iter, offset, len);
+			end_fsb = min_t(xfs_fileoff_t, end_fsb,
+					XFS_B_TO_FSB(mp, end));
+		}
+
 		xfs_trim_extent(&imap, offset_fsb, end_fsb - offset_fsb);
 	}
 
-- 
2.50.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 6/7] iomap: remove old partial eof zeroing optimization
  2025-07-14 20:41 [PATCH v3 0/7] iomap: zero range folio batch support Brian Foster
                   ` (4 preceding siblings ...)
  2025-07-14 20:41 ` [PATCH v3 5/7] xfs: fill dirty folios on zero range of unwritten mappings Brian Foster
@ 2025-07-14 20:41 ` Brian Foster
  2025-07-15  5:34   ` Darrick J. Wong
  2025-07-14 20:41 ` [PATCH v3 7/7] xfs: error tag to force zeroing on debug kernels Brian Foster
  6 siblings, 1 reply; 37+ messages in thread
From: Brian Foster @ 2025-07-14 20:41 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-xfs, linux-mm, hch, djwong, willy

iomap_zero_range() optimizes the partial eof block zeroing use case
by force zeroing if the mapping is dirty. This is to avoid frequent
flushing on file extending workloads, which hurts performance.

Now that the folio batch mechanism provides a more generic solution
and is used by the only real zero range user (XFS), this isolated
optimization is no longer needed. Remove the unnecessary code and
let callers use the folio batch or fall back to flushing by default.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap/buffered-io.c | 24 ------------------------
 1 file changed, 24 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 194e3cc0857f..d2bbed692c06 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1484,33 +1484,9 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
 		.private	= private,
 	};
 	struct address_space *mapping = inode->i_mapping;
-	unsigned int blocksize = i_blocksize(inode);
-	unsigned int off = pos & (blocksize - 1);
-	loff_t plen = min_t(loff_t, len, blocksize - off);
 	int ret;
 	bool range_dirty;
 
-	/*
-	 * Zero range can skip mappings that are zero on disk so long as
-	 * pagecache is clean. If pagecache was dirty prior to zero range, the
-	 * mapping converts on writeback completion and so must be zeroed.
-	 *
-	 * The simplest way to deal with this across a range is to flush
-	 * pagecache and process the updated mappings. To avoid excessive
-	 * flushing on partial eof zeroing, special case it to zero the
-	 * unaligned start portion if already dirty in pagecache.
-	 */
-	if (!iter.fbatch && off &&
-	    filemap_range_needs_writeback(mapping, pos, pos + plen - 1)) {
-		iter.len = plen;
-		while ((ret = iomap_iter(&iter, ops)) > 0)
-			iter.status = iomap_zero_iter(&iter, did_zero);
-
-		iter.len = len - (iter.pos - pos);
-		if (ret || !iter.len)
-			return ret;
-	}
-
 	/*
 	 * To avoid an unconditional flush, check pagecache state and only flush
 	 * if dirty and the fs returns a mapping that might convert on
-- 
2.50.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 7/7] xfs: error tag to force zeroing on debug kernels
  2025-07-14 20:41 [PATCH v3 0/7] iomap: zero range folio batch support Brian Foster
                   ` (5 preceding siblings ...)
  2025-07-14 20:41 ` [PATCH v3 6/7] iomap: remove old partial eof zeroing optimization Brian Foster
@ 2025-07-14 20:41 ` Brian Foster
  2025-07-15  5:24   ` Darrick J. Wong
  6 siblings, 1 reply; 37+ messages in thread
From: Brian Foster @ 2025-07-14 20:41 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-xfs, linux-mm, hch, djwong, willy

iomap_zero_range() has to cover various corner cases that are
difficult to test on production kernels because it is used in fairly
limited use cases. For example, it is currently only used by XFS and
mostly only in partial block zeroing cases.

While it's possible to test most of these functional cases, we can
provide more robust test coverage by co-opting fallocate zero range
to invoke zeroing of the entire range instead of the more efficient
block punch/allocate sequence. Add an errortag to occasionally
invoke forced zeroing.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/libxfs/xfs_errortag.h |  4 +++-
 fs/xfs/xfs_error.c           |  3 +++
 fs/xfs/xfs_file.c            | 26 ++++++++++++++++++++------
 3 files changed, 26 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_errortag.h b/fs/xfs/libxfs/xfs_errortag.h
index a53c5d40e084..33ca3fc2ca88 100644
--- a/fs/xfs/libxfs/xfs_errortag.h
+++ b/fs/xfs/libxfs/xfs_errortag.h
@@ -65,7 +65,8 @@
 #define XFS_ERRTAG_WRITE_DELAY_MS			43
 #define XFS_ERRTAG_EXCHMAPS_FINISH_ONE			44
 #define XFS_ERRTAG_METAFILE_RESV_CRITICAL		45
-#define XFS_ERRTAG_MAX					46
+#define XFS_ERRTAG_FORCE_ZERO_RANGE			46
+#define XFS_ERRTAG_MAX					47
 
 /*
  * Random factors for above tags, 1 means always, 2 means 1/2 time, etc.
@@ -115,5 +116,6 @@
 #define XFS_RANDOM_WRITE_DELAY_MS			3000
 #define XFS_RANDOM_EXCHMAPS_FINISH_ONE			1
 #define XFS_RANDOM_METAFILE_RESV_CRITICAL		4
+#define XFS_RANDOM_FORCE_ZERO_RANGE			4
 
 #endif /* __XFS_ERRORTAG_H_ */
diff --git a/fs/xfs/xfs_error.c b/fs/xfs/xfs_error.c
index dbd87e137694..00c0c391c329 100644
--- a/fs/xfs/xfs_error.c
+++ b/fs/xfs/xfs_error.c
@@ -64,6 +64,7 @@ static unsigned int xfs_errortag_random_default[] = {
 	XFS_RANDOM_WRITE_DELAY_MS,
 	XFS_RANDOM_EXCHMAPS_FINISH_ONE,
 	XFS_RANDOM_METAFILE_RESV_CRITICAL,
+	XFS_RANDOM_FORCE_ZERO_RANGE,
 };
 
 struct xfs_errortag_attr {
@@ -183,6 +184,7 @@ XFS_ERRORTAG_ATTR_RW(wb_delay_ms,	XFS_ERRTAG_WB_DELAY_MS);
 XFS_ERRORTAG_ATTR_RW(write_delay_ms,	XFS_ERRTAG_WRITE_DELAY_MS);
 XFS_ERRORTAG_ATTR_RW(exchmaps_finish_one, XFS_ERRTAG_EXCHMAPS_FINISH_ONE);
 XFS_ERRORTAG_ATTR_RW(metafile_resv_crit, XFS_ERRTAG_METAFILE_RESV_CRITICAL);
+XFS_ERRORTAG_ATTR_RW(force_zero_range, XFS_ERRTAG_FORCE_ZERO_RANGE);
 
 static struct attribute *xfs_errortag_attrs[] = {
 	XFS_ERRORTAG_ATTR_LIST(noerror),
@@ -230,6 +232,7 @@ static struct attribute *xfs_errortag_attrs[] = {
 	XFS_ERRORTAG_ATTR_LIST(write_delay_ms),
 	XFS_ERRORTAG_ATTR_LIST(exchmaps_finish_one),
 	XFS_ERRORTAG_ATTR_LIST(metafile_resv_crit),
+	XFS_ERRORTAG_ATTR_LIST(force_zero_range),
 	NULL,
 };
 ATTRIBUTE_GROUPS(xfs_errortag);
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 0b41b18debf3..c865f9555b77 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -27,6 +27,8 @@
 #include "xfs_file.h"
 #include "xfs_aops.h"
 #include "xfs_zone_alloc.h"
+#include "xfs_error.h"
+#include "xfs_errortag.h"
 
 #include <linux/dax.h>
 #include <linux/falloc.h>
@@ -1269,13 +1271,25 @@ xfs_falloc_zero_range(
 	if (error)
 		return error;
 
-	error = xfs_free_file_space(XFS_I(inode), offset, len, ac);
-	if (error)
-		return error;
+	/*
+	 * Zero range implements a full zeroing mechanism but is only used in
+	 * limited situations. It is more efficient to allocate unwritten
+	 * extents than to perform zeroing here, so use an errortag to randomly
+	 * force zeroing on DEBUG kernels for added test coverage.
+	 */
+	if (XFS_TEST_ERROR(false, XFS_I(inode)->i_mount,
+			   XFS_ERRTAG_FORCE_ZERO_RANGE)) {
+		error = xfs_zero_range(XFS_I(inode), offset, len, ac, NULL);
+	} else {
+		error = xfs_free_file_space(XFS_I(inode), offset, len, ac);
+		if (error)
+			return error;
 
-	len = round_up(offset + len, blksize) - round_down(offset, blksize);
-	offset = round_down(offset, blksize);
-	error = xfs_alloc_file_space(XFS_I(inode), offset, len);
+		len = round_up(offset + len, blksize) -
+			round_down(offset, blksize);
+		offset = round_down(offset, blksize);
+		error = xfs_alloc_file_space(XFS_I(inode), offset, len);
+	}
 	if (error)
 		return error;
 	return xfs_falloc_setsize(file, new_size);
-- 
2.50.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 2/7] iomap: remove pos+len BUG_ON() to after folio lookup
  2025-07-14 20:41 ` [PATCH v3 2/7] iomap: remove pos+len BUG_ON() to after folio lookup Brian Foster
@ 2025-07-15  5:14   ` Darrick J. Wong
  0 siblings, 0 replies; 37+ messages in thread
From: Darrick J. Wong @ 2025-07-15  5:14 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, linux-xfs, linux-mm, hch, willy

On Mon, Jul 14, 2025 at 04:41:17PM -0400, Brian Foster wrote:
> The bug checks at the top of iomap_write_begin() assume the pos/len
> reflect exactly the next range to process. This may no longer be the
> case once the get folio path is able to process a folio batch from
> the filesystem. On top of that, len is already trimmed to within the
> iomap/srcmap by iomap_length(), so these checks aren't terribly
> useful. Remove the unnecessary BUG_ON() checks.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Heh, glad this went away
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

--D

> ---
>  fs/iomap/buffered-io.c | 5 +----
>  1 file changed, 1 insertion(+), 4 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 3729391a18f3..38da2fa6e6b0 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -805,15 +805,12 @@ static int iomap_write_begin(struct iomap_iter *iter, struct folio **foliop,
>  {
>  	const struct iomap_folio_ops *folio_ops = iter->iomap.folio_ops;
>  	const struct iomap *srcmap = iomap_iter_srcmap(iter);
> -	loff_t pos = iter->pos;
> +	loff_t pos;
>  	u64 len = min_t(u64, SIZE_MAX, iomap_length(iter));
>  	struct folio *folio;
>  	int status = 0;
>  
>  	len = min_not_zero(len, *plen);
> -	BUG_ON(pos + len > iter->iomap.offset + iter->iomap.length);
> -	if (srcmap != &iter->iomap)
> -		BUG_ON(pos + len > srcmap->offset + srcmap->length);
>  
>  	if (fatal_signal_pending(current))
>  		return -EINTR;
> -- 
> 2.50.0
> 
> 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 1/7] filemap: add helper to look up dirty folios in a range
  2025-07-14 20:41 ` [PATCH v3 1/7] filemap: add helper to look up dirty folios in a range Brian Foster
@ 2025-07-15  5:20   ` Darrick J. Wong
  0 siblings, 0 replies; 37+ messages in thread
From: Darrick J. Wong @ 2025-07-15  5:20 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, linux-xfs, linux-mm, hch, willy

On Mon, Jul 14, 2025 at 04:41:16PM -0400, Brian Foster wrote:
> Add a new filemap_get_folios_dirty() helper to look up existing dirty
> folios in a range and add them to a folio_batch. This is to support
> optimization of certain iomap operations that only care about dirty
> folios in a target range. For example, zero range only zeroes the subset
> of dirty pages over unwritten mappings, seek hole/data may use similar
> logic in the future, etc.
> 
> Note that the helper is intended for use under internal fs locks.
> Therefore it trylocks folios in order to filter out clean folios.
> This loosely follows the logic from filemap_range_has_writeback().
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>

This seems correct to me, though like hch said, I'd like to hear from
willy.
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

--D

> ---
>  include/linux/pagemap.h |  2 ++
>  mm/filemap.c            | 58 +++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 60 insertions(+)
> 
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index e63fbfbd5b0f..fb83ddf26621 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -941,6 +941,8 @@ unsigned filemap_get_folios_contig(struct address_space *mapping,
>  		pgoff_t *start, pgoff_t end, struct folio_batch *fbatch);
>  unsigned filemap_get_folios_tag(struct address_space *mapping, pgoff_t *start,
>  		pgoff_t end, xa_mark_t tag, struct folio_batch *fbatch);
> +unsigned filemap_get_folios_dirty(struct address_space *mapping,
> +		pgoff_t *start, pgoff_t end, struct folio_batch *fbatch);
>  
>  /*
>   * Returns locked page at given index in given cache, creating it if needed.
> diff --git a/mm/filemap.c b/mm/filemap.c
> index bada249b9fb7..2171b7f689b0 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2334,6 +2334,64 @@ unsigned filemap_get_folios_tag(struct address_space *mapping, pgoff_t *start,
>  }
>  EXPORT_SYMBOL(filemap_get_folios_tag);
>  
> +/**
> + * filemap_get_folios_dirty - Get a batch of dirty folios
> + * @mapping:	The address_space to search
> + * @start:	The starting folio index
> + * @end:	The final folio index (inclusive)
> + * @fbatch:	The batch to fill
> + *
> + * filemap_get_folios_dirty() works exactly like filemap_get_folios(), except
> + * the returned folios are presumed to be dirty or undergoing writeback. Dirty
> + * state is presumed because we don't block on folio lock nor want to miss
> + * folios. Callers that need to can recheck state upon locking the folio.
> + *
> + * This may not return all dirty folios if the batch gets filled up.
> + *
> + * Return: The number of folios found.
> + * Also update @start to be positioned for traversal of the next folio.
> + */
> +unsigned filemap_get_folios_dirty(struct address_space *mapping, pgoff_t *start,
> +			pgoff_t end, struct folio_batch *fbatch)
> +{
> +	XA_STATE(xas, &mapping->i_pages, *start);
> +	struct folio *folio;
> +
> +	rcu_read_lock();
> +	while ((folio = find_get_entry(&xas, end, XA_PRESENT)) != NULL) {
> +		if (xa_is_value(folio))
> +			continue;
> +		if (folio_trylock(folio)) {
> +			bool clean = !folio_test_dirty(folio) &&
> +				     !folio_test_writeback(folio);
> +			folio_unlock(folio);
> +			if (clean) {
> +				folio_put(folio);
> +				continue;
> +			}
> +		}
> +		if (!folio_batch_add(fbatch, folio)) {
> +			unsigned long nr = folio_nr_pages(folio);
> +			*start = folio->index + nr;
> +			goto out;
> +		}
> +	}
> +	/*
> +	 * We come here when there is no folio beyond @end. We take care to not
> +	 * overflow the index @start as it confuses some of the callers. This
> +	 * breaks the iteration when there is a folio at index -1 but that is
> +	 * already broke anyway.
> +	 */
> +	if (end == (pgoff_t)-1)
> +		*start = (pgoff_t)-1;
> +	else
> +		*start = end + 1;
> +out:
> +	rcu_read_unlock();
> +
> +	return folio_batch_count(fbatch);
> +}
> +
>  /*
>   * CD/DVDs are error prone. When a medium error occurs, the driver may fail
>   * a _large_ part of the i/o request. Imagine the worst scenario:
> -- 
> 2.50.0
> 
> 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 3/7] iomap: optional zero range dirty folio processing
  2025-07-14 20:41 ` [PATCH v3 3/7] iomap: optional zero range dirty folio processing Brian Foster
@ 2025-07-15  5:22   ` Darrick J. Wong
  2025-07-15 12:35     ` Brian Foster
  2025-07-18 11:30     ` Zhang Yi
  0 siblings, 2 replies; 37+ messages in thread
From: Darrick J. Wong @ 2025-07-15  5:22 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, linux-xfs, linux-mm, hch, willy

On Mon, Jul 14, 2025 at 04:41:18PM -0400, Brian Foster wrote:
> The only way zero range can currently process unwritten mappings
> with dirty pagecache is to check whether the range is dirty before
> mapping lookup and then flush when at least one underlying mapping
> is unwritten. This ordering is required to prevent iomap lookup from
> racing with folio writeback and reclaim.
> 
> Since zero range can skip ranges of unwritten mappings that are
> clean in cache, this operation can be improved by allowing the
> filesystem to provide a set of dirty folios that require zeroing. In
> turn, rather than flush or iterate file offsets, zero range can
> iterate on folios in the batch and advance over clean or uncached
> ranges in between.
> 
> Add a folio_batch in struct iomap and provide a helper for fs' to

/me confused by the single quote; is this supposed to read:

"...for the fs to populate..."?

Either way the code changes look like a reasonable thing to do for the
pagecache (try to grab a bunch of dirty folios while XFS holds the
mapping lock) so

Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

--D


> populate the batch at lookup time. Update the folio lookup path to
> return the next folio in the batch, if provided, and advance the
> iter if the folio starts beyond the current offset.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/iomap/buffered-io.c | 89 +++++++++++++++++++++++++++++++++++++++---
>  fs/iomap/iter.c        |  6 +++
>  include/linux/iomap.h  |  4 ++
>  3 files changed, 94 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 38da2fa6e6b0..194e3cc0857f 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -750,6 +750,28 @@ static struct folio *__iomap_get_folio(struct iomap_iter *iter, size_t len)
>  	if (!mapping_large_folio_support(iter->inode->i_mapping))
>  		len = min_t(size_t, len, PAGE_SIZE - offset_in_page(pos));
>  
> +	if (iter->fbatch) {
> +		struct folio *folio = folio_batch_next(iter->fbatch);
> +
> +		if (!folio)
> +			return NULL;
> +
> +		/*
> +		 * The folio mapping generally shouldn't have changed based on
> +		 * fs locks, but be consistent with filemap lookup and retry
> +		 * the iter if it does.
> +		 */
> +		folio_lock(folio);
> +		if (unlikely(folio->mapping != iter->inode->i_mapping)) {
> +			iter->iomap.flags |= IOMAP_F_STALE;
> +			folio_unlock(folio);
> +			return NULL;
> +		}
> +
> +		folio_get(folio);
> +		return folio;
> +	}
> +
>  	if (folio_ops && folio_ops->get_folio)
>  		return folio_ops->get_folio(iter, pos, len);
>  	else
> @@ -811,6 +833,8 @@ static int iomap_write_begin(struct iomap_iter *iter, struct folio **foliop,
>  	int status = 0;
>  
>  	len = min_not_zero(len, *plen);
> +	*foliop = NULL;
> +	*plen = 0;
>  
>  	if (fatal_signal_pending(current))
>  		return -EINTR;
> @@ -819,6 +843,15 @@ static int iomap_write_begin(struct iomap_iter *iter, struct folio **foliop,
>  	if (IS_ERR(folio))
>  		return PTR_ERR(folio);
>  
> +	/*
> +	 * No folio means we're done with a batch. We still have range to
> +	 * process so return and let the caller iterate and refill the batch.
> +	 */
> +	if (!folio) {
> +		WARN_ON_ONCE(!iter->fbatch);
> +		return 0;
> +	}
> +
>  	/*
>  	 * Now we have a locked folio, before we do anything with it we need to
>  	 * check that the iomap we have cached is not stale. The inode extent
> @@ -839,6 +872,21 @@ static int iomap_write_begin(struct iomap_iter *iter, struct folio **foliop,
>  		}
>  	}
>  
> +	/*
> +	 * The folios in a batch may not be contiguous. If we've skipped
> +	 * forward, advance the iter to the pos of the current folio. If the
> +	 * folio starts beyond the end of the mapping, it may have been trimmed
> +	 * since the lookup for whatever reason. Return a NULL folio to
> +	 * terminate the op.
> +	 */
> +	if (folio_pos(folio) > iter->pos) {
> +		len = min_t(u64, folio_pos(folio) - iter->pos,
> +				 iomap_length(iter));
> +		status = iomap_iter_advance(iter, &len);
> +		if (status || !len)
> +			goto out_unlock;
> +	}
> +
>  	pos = iomap_trim_folio_range(iter, folio, poffset, &len);
>  
>  	if (srcmap->type == IOMAP_INLINE)
> @@ -1377,6 +1425,12 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
>  		if (iter->iomap.flags & IOMAP_F_STALE)
>  			break;
>  
> +		/* a NULL folio means we're done with a folio batch */
> +		if (!folio) {
> +			status = iomap_iter_advance_full(iter);
> +			break;
> +		}
> +
>  		/* warn about zeroing folios beyond eof that won't write back */
>  		WARN_ON_ONCE(folio_pos(folio) > iter->inode->i_size);
>  
> @@ -1398,6 +1452,26 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
>  	return status;
>  }
>  
> +loff_t
> +iomap_fill_dirty_folios(
> +	struct iomap_iter	*iter,
> +	loff_t			offset,
> +	loff_t			length)
> +{
> +	struct address_space	*mapping = iter->inode->i_mapping;
> +	pgoff_t			start = offset >> PAGE_SHIFT;
> +	pgoff_t			end = (offset + length - 1) >> PAGE_SHIFT;
> +
> +	iter->fbatch = kmalloc(sizeof(struct folio_batch), GFP_KERNEL);
> +	if (!iter->fbatch)
> +		return offset + length;
> +	folio_batch_init(iter->fbatch);
> +
> +	filemap_get_folios_dirty(mapping, &start, end, iter->fbatch);
> +	return (start << PAGE_SHIFT);
> +}
> +EXPORT_SYMBOL_GPL(iomap_fill_dirty_folios);
> +
>  int
>  iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
>  		const struct iomap_ops *ops, void *private)
> @@ -1426,7 +1500,7 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
>  	 * flushing on partial eof zeroing, special case it to zero the
>  	 * unaligned start portion if already dirty in pagecache.
>  	 */
> -	if (off &&
> +	if (!iter.fbatch && off &&
>  	    filemap_range_needs_writeback(mapping, pos, pos + plen - 1)) {
>  		iter.len = plen;
>  		while ((ret = iomap_iter(&iter, ops)) > 0)
> @@ -1442,13 +1516,18 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
>  	 * if dirty and the fs returns a mapping that might convert on
>  	 * writeback.
>  	 */
> -	range_dirty = filemap_range_needs_writeback(inode->i_mapping,
> -					iter.pos, iter.pos + iter.len - 1);
> +	range_dirty = filemap_range_needs_writeback(mapping, iter.pos,
> +					iter.pos + iter.len - 1);
>  	while ((ret = iomap_iter(&iter, ops)) > 0) {
>  		const struct iomap *srcmap = iomap_iter_srcmap(&iter);
>  
> -		if (srcmap->type == IOMAP_HOLE ||
> -		    srcmap->type == IOMAP_UNWRITTEN) {
> +		if (WARN_ON_ONCE(iter.fbatch &&
> +				 srcmap->type != IOMAP_UNWRITTEN))
> +			return -EIO;
> +
> +		if (!iter.fbatch &&
> +		    (srcmap->type == IOMAP_HOLE ||
> +		     srcmap->type == IOMAP_UNWRITTEN)) {
>  			s64 status;
>  
>  			if (range_dirty) {
> diff --git a/fs/iomap/iter.c b/fs/iomap/iter.c
> index 6ffc6a7b9ba5..89bd5951a6fd 100644
> --- a/fs/iomap/iter.c
> +++ b/fs/iomap/iter.c
> @@ -9,6 +9,12 @@
>  
>  static inline void iomap_iter_reset_iomap(struct iomap_iter *iter)
>  {
> +	if (iter->fbatch) {
> +		folio_batch_release(iter->fbatch);
> +		kfree(iter->fbatch);
> +		iter->fbatch = NULL;
> +	}
> +
>  	iter->status = 0;
>  	memset(&iter->iomap, 0, sizeof(iter->iomap));
>  	memset(&iter->srcmap, 0, sizeof(iter->srcmap));
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 522644d62f30..0b9b460b2873 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -9,6 +9,7 @@
>  #include <linux/types.h>
>  #include <linux/mm_types.h>
>  #include <linux/blkdev.h>
> +#include <linux/pagevec.h>
>  
>  struct address_space;
>  struct fiemap_extent_info;
> @@ -239,6 +240,7 @@ struct iomap_iter {
>  	unsigned flags;
>  	struct iomap iomap;
>  	struct iomap srcmap;
> +	struct folio_batch *fbatch;
>  	void *private;
>  };
>  
> @@ -345,6 +347,8 @@ void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len);
>  bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio);
>  int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
>  		const struct iomap_ops *ops);
> +loff_t iomap_fill_dirty_folios(struct iomap_iter *iter, loff_t offset,
> +		loff_t length);
>  int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len,
>  		bool *did_zero, const struct iomap_ops *ops, void *private);
>  int iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
> -- 
> 2.50.0
> 
> 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 7/7] xfs: error tag to force zeroing on debug kernels
  2025-07-14 20:41 ` [PATCH v3 7/7] xfs: error tag to force zeroing on debug kernels Brian Foster
@ 2025-07-15  5:24   ` Darrick J. Wong
  2025-07-15 12:39     ` Brian Foster
  0 siblings, 1 reply; 37+ messages in thread
From: Darrick J. Wong @ 2025-07-15  5:24 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, linux-xfs, linux-mm, hch, willy

On Mon, Jul 14, 2025 at 04:41:22PM -0400, Brian Foster wrote:
> iomap_zero_range() has to cover various corner cases that are
> difficult to test on production kernels because it is used in fairly
> limited use cases. For example, it is currently only used by XFS and
> mostly only in partial block zeroing cases.
> 
> While it's possible to test most of these functional cases, we can
> provide more robust test coverage by co-opting fallocate zero range
> to invoke zeroing of the entire range instead of the more efficient
> block punch/allocate sequence. Add an errortag to occasionally
> invoke forced zeroing.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/xfs/libxfs/xfs_errortag.h |  4 +++-
>  fs/xfs/xfs_error.c           |  3 +++
>  fs/xfs/xfs_file.c            | 26 ++++++++++++++++++++------
>  3 files changed, 26 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_errortag.h b/fs/xfs/libxfs/xfs_errortag.h
> index a53c5d40e084..33ca3fc2ca88 100644
> --- a/fs/xfs/libxfs/xfs_errortag.h
> +++ b/fs/xfs/libxfs/xfs_errortag.h
> @@ -65,7 +65,8 @@
>  #define XFS_ERRTAG_WRITE_DELAY_MS			43
>  #define XFS_ERRTAG_EXCHMAPS_FINISH_ONE			44
>  #define XFS_ERRTAG_METAFILE_RESV_CRITICAL		45
> -#define XFS_ERRTAG_MAX					46
> +#define XFS_ERRTAG_FORCE_ZERO_RANGE			46
> +#define XFS_ERRTAG_MAX					47
>  
>  /*
>   * Random factors for above tags, 1 means always, 2 means 1/2 time, etc.
> @@ -115,5 +116,6 @@
>  #define XFS_RANDOM_WRITE_DELAY_MS			3000
>  #define XFS_RANDOM_EXCHMAPS_FINISH_ONE			1
>  #define XFS_RANDOM_METAFILE_RESV_CRITICAL		4
> +#define XFS_RANDOM_FORCE_ZERO_RANGE			4
>  
>  #endif /* __XFS_ERRORTAG_H_ */
> diff --git a/fs/xfs/xfs_error.c b/fs/xfs/xfs_error.c
> index dbd87e137694..00c0c391c329 100644
> --- a/fs/xfs/xfs_error.c
> +++ b/fs/xfs/xfs_error.c
> @@ -64,6 +64,7 @@ static unsigned int xfs_errortag_random_default[] = {
>  	XFS_RANDOM_WRITE_DELAY_MS,
>  	XFS_RANDOM_EXCHMAPS_FINISH_ONE,
>  	XFS_RANDOM_METAFILE_RESV_CRITICAL,
> +	XFS_RANDOM_FORCE_ZERO_RANGE,
>  };
>  
>  struct xfs_errortag_attr {
> @@ -183,6 +184,7 @@ XFS_ERRORTAG_ATTR_RW(wb_delay_ms,	XFS_ERRTAG_WB_DELAY_MS);
>  XFS_ERRORTAG_ATTR_RW(write_delay_ms,	XFS_ERRTAG_WRITE_DELAY_MS);
>  XFS_ERRORTAG_ATTR_RW(exchmaps_finish_one, XFS_ERRTAG_EXCHMAPS_FINISH_ONE);
>  XFS_ERRORTAG_ATTR_RW(metafile_resv_crit, XFS_ERRTAG_METAFILE_RESV_CRITICAL);
> +XFS_ERRORTAG_ATTR_RW(force_zero_range, XFS_ERRTAG_FORCE_ZERO_RANGE);
>  
>  static struct attribute *xfs_errortag_attrs[] = {
>  	XFS_ERRORTAG_ATTR_LIST(noerror),
> @@ -230,6 +232,7 @@ static struct attribute *xfs_errortag_attrs[] = {
>  	XFS_ERRORTAG_ATTR_LIST(write_delay_ms),
>  	XFS_ERRORTAG_ATTR_LIST(exchmaps_finish_one),
>  	XFS_ERRORTAG_ATTR_LIST(metafile_resv_crit),
> +	XFS_ERRORTAG_ATTR_LIST(force_zero_range),
>  	NULL,
>  };
>  ATTRIBUTE_GROUPS(xfs_errortag);
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 0b41b18debf3..c865f9555b77 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -27,6 +27,8 @@
>  #include "xfs_file.h"
>  #include "xfs_aops.h"
>  #include "xfs_zone_alloc.h"
> +#include "xfs_error.h"
> +#include "xfs_errortag.h"
>  
>  #include <linux/dax.h>
>  #include <linux/falloc.h>
> @@ -1269,13 +1271,25 @@ xfs_falloc_zero_range(
>  	if (error)
>  		return error;
>  
> -	error = xfs_free_file_space(XFS_I(inode), offset, len, ac);
> -	if (error)
> -		return error;
> +	/*
> +	 * Zero range implements a full zeroing mechanism but is only used in
> +	 * limited situations. It is more efficient to allocate unwritten
> +	 * extents than to perform zeroing here, so use an errortag to randomly
> +	 * force zeroing on DEBUG kernels for added test coverage.
> +	 */
> +	if (XFS_TEST_ERROR(false, XFS_I(inode)->i_mount,
> +			   XFS_ERRTAG_FORCE_ZERO_RANGE)) {
> +		error = xfs_zero_range(XFS_I(inode), offset, len, ac, NULL);

Isn't this basically the ultra slow version fallback version of
FALLOC_FL_WRITE_ZEROES ?

--D

> +	} else {
> +		error = xfs_free_file_space(XFS_I(inode), offset, len, ac);
> +		if (error)
> +			return error;
>  
> -	len = round_up(offset + len, blksize) - round_down(offset, blksize);
> -	offset = round_down(offset, blksize);
> -	error = xfs_alloc_file_space(XFS_I(inode), offset, len);
> +		len = round_up(offset + len, blksize) -
> +			round_down(offset, blksize);
> +		offset = round_down(offset, blksize);
> +		error = xfs_alloc_file_space(XFS_I(inode), offset, len);
> +	}
>  	if (error)
>  		return error;
>  	return xfs_falloc_setsize(file, new_size);
> -- 
> 2.50.0
> 
> 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 5/7] xfs: fill dirty folios on zero range of unwritten mappings
  2025-07-14 20:41 ` [PATCH v3 5/7] xfs: fill dirty folios on zero range of unwritten mappings Brian Foster
@ 2025-07-15  5:28   ` Darrick J. Wong
  2025-07-15 12:35     ` Brian Foster
  0 siblings, 1 reply; 37+ messages in thread
From: Darrick J. Wong @ 2025-07-15  5:28 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, linux-xfs, linux-mm, hch, willy

On Mon, Jul 14, 2025 at 04:41:20PM -0400, Brian Foster wrote:
> Use the iomap folio batch mechanism to select folios to zero on zero
> range of unwritten mappings. Trim the resulting mapping if the batch
> is filled (unlikely for current use cases) to distinguish between a
> range to skip and one that requires another iteration due to a full
> batch.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/xfs/xfs_iomap.c | 23 +++++++++++++++++++++++
>  1 file changed, 23 insertions(+)
> 
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index b5cf5bc6308d..63054f7ead0e 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -1691,6 +1691,8 @@ xfs_buffered_write_iomap_begin(
>  	struct iomap		*iomap,
>  	struct iomap		*srcmap)
>  {
> +	struct iomap_iter	*iter = container_of(iomap, struct iomap_iter,
> +						     iomap);
>  	struct xfs_inode	*ip = XFS_I(inode);
>  	struct xfs_mount	*mp = ip->i_mount;
>  	xfs_fileoff_t		offset_fsb = XFS_B_TO_FSBT(mp, offset);
> @@ -1762,6 +1764,7 @@ xfs_buffered_write_iomap_begin(
>  	 */
>  	if (flags & IOMAP_ZERO) {
>  		xfs_fileoff_t eof_fsb = XFS_B_TO_FSB(mp, XFS_ISIZE(ip));
> +		u64 end;
>  
>  		if (isnullstartblock(imap.br_startblock) &&
>  		    offset_fsb >= eof_fsb)
> @@ -1769,6 +1772,26 @@ xfs_buffered_write_iomap_begin(
>  		if (offset_fsb < eof_fsb && end_fsb > eof_fsb)
>  			end_fsb = eof_fsb;
>  
> +		/*
> +		 * Look up dirty folios for unwritten mappings within EOF.
> +		 * Providing this bypasses the flush iomap uses to trigger
> +		 * extent conversion when unwritten mappings have dirty
> +		 * pagecache in need of zeroing.
> +		 *
> +		 * Trim the mapping to the end pos of the lookup, which in turn
> +		 * was trimmed to the end of the batch if it became full before
> +		 * the end of the mapping.
> +		 */
> +		if (imap.br_state == XFS_EXT_UNWRITTEN &&
> +		    offset_fsb < eof_fsb) {
> +			loff_t len = min(count,
> +					 XFS_FSB_TO_B(mp, imap.br_blockcount));
> +
> +			end = iomap_fill_dirty_folios(iter, offset, len);
> +			end_fsb = min_t(xfs_fileoff_t, end_fsb,
> +					XFS_B_TO_FSB(mp, end));

Hrmm.  XFS_B_TO_FSB and not _FSBT?  Can the rounding up behavior result
in a missed byte range?  I think the answer is no because @end should be
aligned to a folio boundary, and folios can't be smaller than an
fsblock.

If the answer to the second question is indeed "no" then I think this is
ok and
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

--D


> +		}
> +
>  		xfs_trim_extent(&imap, offset_fsb, end_fsb - offset_fsb);
>  	}
>  
> -- 
> 2.50.0
> 
> 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 6/7] iomap: remove old partial eof zeroing optimization
  2025-07-14 20:41 ` [PATCH v3 6/7] iomap: remove old partial eof zeroing optimization Brian Foster
@ 2025-07-15  5:34   ` Darrick J. Wong
  2025-07-15 12:36     ` Brian Foster
  0 siblings, 1 reply; 37+ messages in thread
From: Darrick J. Wong @ 2025-07-15  5:34 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, linux-xfs, linux-mm, hch, willy

On Mon, Jul 14, 2025 at 04:41:21PM -0400, Brian Foster wrote:
> iomap_zero_range() optimizes the partial eof block zeroing use case
> by force zeroing if the mapping is dirty. This is to avoid frequent
> flushing on file extending workloads, which hurts performance.
> 
> Now that the folio batch mechanism provides a more generic solution
> and is used by the only real zero range user (XFS), this isolated
> optimization is no longer needed. Remove the unnecessary code and
> let callers use the folio batch or fall back to flushing by default.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Heh, I was staring at this last Friday chasing fuse+iomap bugs in
fallocate zerorange and straining to remember what this does.
Is this chunk still needed if the ->iomap_begin implementation doesn't
(or forgets to) grab the folio batch for iomap?

My bug turned out to be a bug in my fuse+iomap design -- with the way
iomap_zero_range does things, you have to flush+unmap, punch the range
and zero the range.  If you punch and realloc the range and *then* try
to zero the range, the new unwritten extents cause iomap to miss dirty
pages that fuse should've unmapped.  Ooops.

--D

> ---
>  fs/iomap/buffered-io.c | 24 ------------------------
>  1 file changed, 24 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 194e3cc0857f..d2bbed692c06 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -1484,33 +1484,9 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
>  		.private	= private,
>  	};
>  	struct address_space *mapping = inode->i_mapping;
> -	unsigned int blocksize = i_blocksize(inode);
> -	unsigned int off = pos & (blocksize - 1);
> -	loff_t plen = min_t(loff_t, len, blocksize - off);
>  	int ret;
>  	bool range_dirty;
>  
> -	/*
> -	 * Zero range can skip mappings that are zero on disk so long as
> -	 * pagecache is clean. If pagecache was dirty prior to zero range, the
> -	 * mapping converts on writeback completion and so must be zeroed.
> -	 *
> -	 * The simplest way to deal with this across a range is to flush
> -	 * pagecache and process the updated mappings. To avoid excessive
> -	 * flushing on partial eof zeroing, special case it to zero the
> -	 * unaligned start portion if already dirty in pagecache.
> -	 */
> -	if (!iter.fbatch && off &&
> -	    filemap_range_needs_writeback(mapping, pos, pos + plen - 1)) {
> -		iter.len = plen;
> -		while ((ret = iomap_iter(&iter, ops)) > 0)
> -			iter.status = iomap_zero_iter(&iter, did_zero);
> -
> -		iter.len = len - (iter.pos - pos);
> -		if (ret || !iter.len)
> -			return ret;
> -	}
> -
>  	/*
>  	 * To avoid an unconditional flush, check pagecache state and only flush
>  	 * if dirty and the fs returns a mapping that might convert on
> -- 
> 2.50.0
> 
> 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 3/7] iomap: optional zero range dirty folio processing
  2025-07-15  5:22   ` Darrick J. Wong
@ 2025-07-15 12:35     ` Brian Foster
  2025-07-18 11:30     ` Zhang Yi
  1 sibling, 0 replies; 37+ messages in thread
From: Brian Foster @ 2025-07-15 12:35 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, linux-xfs, linux-mm, hch, willy

On Mon, Jul 14, 2025 at 10:22:59PM -0700, Darrick J. Wong wrote:
> On Mon, Jul 14, 2025 at 04:41:18PM -0400, Brian Foster wrote:
> > The only way zero range can currently process unwritten mappings
> > with dirty pagecache is to check whether the range is dirty before
> > mapping lookup and then flush when at least one underlying mapping
> > is unwritten. This ordering is required to prevent iomap lookup from
> > racing with folio writeback and reclaim.
> > 
> > Since zero range can skip ranges of unwritten mappings that are
> > clean in cache, this operation can be improved by allowing the
> > filesystem to provide a set of dirty folios that require zeroing. In
> > turn, rather than flush or iterate file offsets, zero range can
> > iterate on folios in the batch and advance over clean or uncached
> > ranges in between.
> > 
> > Add a folio_batch in struct iomap and provide a helper for fs' to
> 
> /me confused by the single quote; is this supposed to read:
> 
> "...for the fs to populate..."?
> 

Eh, I intended it to read "for filesystems to populate." I'll change it
to that locally.

Brian

> Either way the code changes look like a reasonable thing to do for the
> pagecache (try to grab a bunch of dirty folios while XFS holds the
> mapping lock) so
> 
> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
> 
> --D
> 
> 
> > populate the batch at lookup time. Update the folio lookup path to
> > return the next folio in the batch, if provided, and advance the
> > iter if the folio starts beyond the current offset.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > ---
> >  fs/iomap/buffered-io.c | 89 +++++++++++++++++++++++++++++++++++++++---
> >  fs/iomap/iter.c        |  6 +++
> >  include/linux/iomap.h  |  4 ++
> >  3 files changed, 94 insertions(+), 5 deletions(-)
> > 
> > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > index 38da2fa6e6b0..194e3cc0857f 100644
> > --- a/fs/iomap/buffered-io.c
> > +++ b/fs/iomap/buffered-io.c
> > @@ -750,6 +750,28 @@ static struct folio *__iomap_get_folio(struct iomap_iter *iter, size_t len)
> >  	if (!mapping_large_folio_support(iter->inode->i_mapping))
> >  		len = min_t(size_t, len, PAGE_SIZE - offset_in_page(pos));
> >  
> > +	if (iter->fbatch) {
> > +		struct folio *folio = folio_batch_next(iter->fbatch);
> > +
> > +		if (!folio)
> > +			return NULL;
> > +
> > +		/*
> > +		 * The folio mapping generally shouldn't have changed based on
> > +		 * fs locks, but be consistent with filemap lookup and retry
> > +		 * the iter if it does.
> > +		 */
> > +		folio_lock(folio);
> > +		if (unlikely(folio->mapping != iter->inode->i_mapping)) {
> > +			iter->iomap.flags |= IOMAP_F_STALE;
> > +			folio_unlock(folio);
> > +			return NULL;
> > +		}
> > +
> > +		folio_get(folio);
> > +		return folio;
> > +	}
> > +
> >  	if (folio_ops && folio_ops->get_folio)
> >  		return folio_ops->get_folio(iter, pos, len);
> >  	else
> > @@ -811,6 +833,8 @@ static int iomap_write_begin(struct iomap_iter *iter, struct folio **foliop,
> >  	int status = 0;
> >  
> >  	len = min_not_zero(len, *plen);
> > +	*foliop = NULL;
> > +	*plen = 0;
> >  
> >  	if (fatal_signal_pending(current))
> >  		return -EINTR;
> > @@ -819,6 +843,15 @@ static int iomap_write_begin(struct iomap_iter *iter, struct folio **foliop,
> >  	if (IS_ERR(folio))
> >  		return PTR_ERR(folio);
> >  
> > +	/*
> > +	 * No folio means we're done with a batch. We still have range to
> > +	 * process so return and let the caller iterate and refill the batch.
> > +	 */
> > +	if (!folio) {
> > +		WARN_ON_ONCE(!iter->fbatch);
> > +		return 0;
> > +	}
> > +
> >  	/*
> >  	 * Now we have a locked folio, before we do anything with it we need to
> >  	 * check that the iomap we have cached is not stale. The inode extent
> > @@ -839,6 +872,21 @@ static int iomap_write_begin(struct iomap_iter *iter, struct folio **foliop,
> >  		}
> >  	}
> >  
> > +	/*
> > +	 * The folios in a batch may not be contiguous. If we've skipped
> > +	 * forward, advance the iter to the pos of the current folio. If the
> > +	 * folio starts beyond the end of the mapping, it may have been trimmed
> > +	 * since the lookup for whatever reason. Return a NULL folio to
> > +	 * terminate the op.
> > +	 */
> > +	if (folio_pos(folio) > iter->pos) {
> > +		len = min_t(u64, folio_pos(folio) - iter->pos,
> > +				 iomap_length(iter));
> > +		status = iomap_iter_advance(iter, &len);
> > +		if (status || !len)
> > +			goto out_unlock;
> > +	}
> > +
> >  	pos = iomap_trim_folio_range(iter, folio, poffset, &len);
> >  
> >  	if (srcmap->type == IOMAP_INLINE)
> > @@ -1377,6 +1425,12 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
> >  		if (iter->iomap.flags & IOMAP_F_STALE)
> >  			break;
> >  
> > +		/* a NULL folio means we're done with a folio batch */
> > +		if (!folio) {
> > +			status = iomap_iter_advance_full(iter);
> > +			break;
> > +		}
> > +
> >  		/* warn about zeroing folios beyond eof that won't write back */
> >  		WARN_ON_ONCE(folio_pos(folio) > iter->inode->i_size);
> >  
> > @@ -1398,6 +1452,26 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
> >  	return status;
> >  }
> >  
> > +loff_t
> > +iomap_fill_dirty_folios(
> > +	struct iomap_iter	*iter,
> > +	loff_t			offset,
> > +	loff_t			length)
> > +{
> > +	struct address_space	*mapping = iter->inode->i_mapping;
> > +	pgoff_t			start = offset >> PAGE_SHIFT;
> > +	pgoff_t			end = (offset + length - 1) >> PAGE_SHIFT;
> > +
> > +	iter->fbatch = kmalloc(sizeof(struct folio_batch), GFP_KERNEL);
> > +	if (!iter->fbatch)
> > +		return offset + length;
> > +	folio_batch_init(iter->fbatch);
> > +
> > +	filemap_get_folios_dirty(mapping, &start, end, iter->fbatch);
> > +	return (start << PAGE_SHIFT);
> > +}
> > +EXPORT_SYMBOL_GPL(iomap_fill_dirty_folios);
> > +
> >  int
> >  iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
> >  		const struct iomap_ops *ops, void *private)
> > @@ -1426,7 +1500,7 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
> >  	 * flushing on partial eof zeroing, special case it to zero the
> >  	 * unaligned start portion if already dirty in pagecache.
> >  	 */
> > -	if (off &&
> > +	if (!iter.fbatch && off &&
> >  	    filemap_range_needs_writeback(mapping, pos, pos + plen - 1)) {
> >  		iter.len = plen;
> >  		while ((ret = iomap_iter(&iter, ops)) > 0)
> > @@ -1442,13 +1516,18 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
> >  	 * if dirty and the fs returns a mapping that might convert on
> >  	 * writeback.
> >  	 */
> > -	range_dirty = filemap_range_needs_writeback(inode->i_mapping,
> > -					iter.pos, iter.pos + iter.len - 1);
> > +	range_dirty = filemap_range_needs_writeback(mapping, iter.pos,
> > +					iter.pos + iter.len - 1);
> >  	while ((ret = iomap_iter(&iter, ops)) > 0) {
> >  		const struct iomap *srcmap = iomap_iter_srcmap(&iter);
> >  
> > -		if (srcmap->type == IOMAP_HOLE ||
> > -		    srcmap->type == IOMAP_UNWRITTEN) {
> > +		if (WARN_ON_ONCE(iter.fbatch &&
> > +				 srcmap->type != IOMAP_UNWRITTEN))
> > +			return -EIO;
> > +
> > +		if (!iter.fbatch &&
> > +		    (srcmap->type == IOMAP_HOLE ||
> > +		     srcmap->type == IOMAP_UNWRITTEN)) {
> >  			s64 status;
> >  
> >  			if (range_dirty) {
> > diff --git a/fs/iomap/iter.c b/fs/iomap/iter.c
> > index 6ffc6a7b9ba5..89bd5951a6fd 100644
> > --- a/fs/iomap/iter.c
> > +++ b/fs/iomap/iter.c
> > @@ -9,6 +9,12 @@
> >  
> >  static inline void iomap_iter_reset_iomap(struct iomap_iter *iter)
> >  {
> > +	if (iter->fbatch) {
> > +		folio_batch_release(iter->fbatch);
> > +		kfree(iter->fbatch);
> > +		iter->fbatch = NULL;
> > +	}
> > +
> >  	iter->status = 0;
> >  	memset(&iter->iomap, 0, sizeof(iter->iomap));
> >  	memset(&iter->srcmap, 0, sizeof(iter->srcmap));
> > diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> > index 522644d62f30..0b9b460b2873 100644
> > --- a/include/linux/iomap.h
> > +++ b/include/linux/iomap.h
> > @@ -9,6 +9,7 @@
> >  #include <linux/types.h>
> >  #include <linux/mm_types.h>
> >  #include <linux/blkdev.h>
> > +#include <linux/pagevec.h>
> >  
> >  struct address_space;
> >  struct fiemap_extent_info;
> > @@ -239,6 +240,7 @@ struct iomap_iter {
> >  	unsigned flags;
> >  	struct iomap iomap;
> >  	struct iomap srcmap;
> > +	struct folio_batch *fbatch;
> >  	void *private;
> >  };
> >  
> > @@ -345,6 +347,8 @@ void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len);
> >  bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio);
> >  int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
> >  		const struct iomap_ops *ops);
> > +loff_t iomap_fill_dirty_folios(struct iomap_iter *iter, loff_t offset,
> > +		loff_t length);
> >  int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len,
> >  		bool *did_zero, const struct iomap_ops *ops, void *private);
> >  int iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
> > -- 
> > 2.50.0
> > 
> > 
> 



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 5/7] xfs: fill dirty folios on zero range of unwritten mappings
  2025-07-15  5:28   ` Darrick J. Wong
@ 2025-07-15 12:35     ` Brian Foster
  2025-07-15 14:19       ` Darrick J. Wong
  0 siblings, 1 reply; 37+ messages in thread
From: Brian Foster @ 2025-07-15 12:35 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, linux-xfs, linux-mm, hch, willy

On Mon, Jul 14, 2025 at 10:28:11PM -0700, Darrick J. Wong wrote:
> On Mon, Jul 14, 2025 at 04:41:20PM -0400, Brian Foster wrote:
> > Use the iomap folio batch mechanism to select folios to zero on zero
> > range of unwritten mappings. Trim the resulting mapping if the batch
> > is filled (unlikely for current use cases) to distinguish between a
> > range to skip and one that requires another iteration due to a full
> > batch.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > ---
> >  fs/xfs/xfs_iomap.c | 23 +++++++++++++++++++++++
> >  1 file changed, 23 insertions(+)
> > 
> > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > index b5cf5bc6308d..63054f7ead0e 100644
> > --- a/fs/xfs/xfs_iomap.c
> > +++ b/fs/xfs/xfs_iomap.c
> > @@ -1691,6 +1691,8 @@ xfs_buffered_write_iomap_begin(
> >  	struct iomap		*iomap,
> >  	struct iomap		*srcmap)
> >  {
> > +	struct iomap_iter	*iter = container_of(iomap, struct iomap_iter,
> > +						     iomap);
> >  	struct xfs_inode	*ip = XFS_I(inode);
> >  	struct xfs_mount	*mp = ip->i_mount;
> >  	xfs_fileoff_t		offset_fsb = XFS_B_TO_FSBT(mp, offset);
> > @@ -1762,6 +1764,7 @@ xfs_buffered_write_iomap_begin(
> >  	 */
> >  	if (flags & IOMAP_ZERO) {
> >  		xfs_fileoff_t eof_fsb = XFS_B_TO_FSB(mp, XFS_ISIZE(ip));
> > +		u64 end;
> >  
> >  		if (isnullstartblock(imap.br_startblock) &&
> >  		    offset_fsb >= eof_fsb)
> > @@ -1769,6 +1772,26 @@ xfs_buffered_write_iomap_begin(
> >  		if (offset_fsb < eof_fsb && end_fsb > eof_fsb)
> >  			end_fsb = eof_fsb;
> >  
> > +		/*
> > +		 * Look up dirty folios for unwritten mappings within EOF.
> > +		 * Providing this bypasses the flush iomap uses to trigger
> > +		 * extent conversion when unwritten mappings have dirty
> > +		 * pagecache in need of zeroing.
> > +		 *
> > +		 * Trim the mapping to the end pos of the lookup, which in turn
> > +		 * was trimmed to the end of the batch if it became full before
> > +		 * the end of the mapping.
> > +		 */
> > +		if (imap.br_state == XFS_EXT_UNWRITTEN &&
> > +		    offset_fsb < eof_fsb) {
> > +			loff_t len = min(count,
> > +					 XFS_FSB_TO_B(mp, imap.br_blockcount));
> > +
> > +			end = iomap_fill_dirty_folios(iter, offset, len);
> > +			end_fsb = min_t(xfs_fileoff_t, end_fsb,
> > +					XFS_B_TO_FSB(mp, end));
> 
> Hrmm.  XFS_B_TO_FSB and not _FSBT?  Can the rounding up behavior result
> in a missed byte range?  I think the answer is no because @end should be
> aligned to a folio boundary, and folios can't be smaller than an
> fsblock.
> 

Hmm.. not that I'm aware of..? Please elaborate if there's a case you're
suspicious of because I could have certainly got my wires crossed.

My thinking is that end_fsb reflects the first fsb beyond the target
range. I.e., it's calculated and used as such in xfs_iomap_end_fsb() and
the various xfs_trim_extent() calls throughout the rest of the function.

Brian

> If the answer to the second question is indeed "no" then I think this is
> ok and
> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
> 
> --D
> 
> 
> > +		}
> > +
> >  		xfs_trim_extent(&imap, offset_fsb, end_fsb - offset_fsb);
> >  	}
> >  
> > -- 
> > 2.50.0
> > 
> > 
> 



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 6/7] iomap: remove old partial eof zeroing optimization
  2025-07-15  5:34   ` Darrick J. Wong
@ 2025-07-15 12:36     ` Brian Foster
  2025-07-15 14:37       ` Darrick J. Wong
  0 siblings, 1 reply; 37+ messages in thread
From: Brian Foster @ 2025-07-15 12:36 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, linux-xfs, linux-mm, hch, willy

On Mon, Jul 14, 2025 at 10:34:17PM -0700, Darrick J. Wong wrote:
> On Mon, Jul 14, 2025 at 04:41:21PM -0400, Brian Foster wrote:
> > iomap_zero_range() optimizes the partial eof block zeroing use case
> > by force zeroing if the mapping is dirty. This is to avoid frequent
> > flushing on file extending workloads, which hurts performance.
> > 
> > Now that the folio batch mechanism provides a more generic solution
> > and is used by the only real zero range user (XFS), this isolated
> > optimization is no longer needed. Remove the unnecessary code and
> > let callers use the folio batch or fall back to flushing by default.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > Reviewed-by: Christoph Hellwig <hch@lst.de>
> 
> Heh, I was staring at this last Friday chasing fuse+iomap bugs in
> fallocate zerorange and straining to remember what this does.
> Is this chunk still needed if the ->iomap_begin implementation doesn't
> (or forgets to) grab the folio batch for iomap?
> 

No, the hunk removed by this patch is just an optimization. The fallback
code here flushes the range if it's dirty and retries the lookup (i.e.
picking up unwritten conversions that were pending via dirty pagecache).
That flush logic caused a performance regression in a particular
workload, so this was introduced to mitigate that regression by just
doing the zeroing for the first block or so if the folio is dirty. [1]

The reason for removing it is more just for maintainability. XFS is
really the only user here and it is changing over to the more generic
batch mechanism, which effectively provides the same optimization, so
this basically becomes dead/duplicate code. If an fs doesn't use the
batch mechanism it will just fall back to the flush and retry approach,
which can be slower but is functionally correct.

> My bug turned out to be a bug in my fuse+iomap design -- with the way
> iomap_zero_range does things, you have to flush+unmap, punch the range
> and zero the range.  If you punch and realloc the range and *then* try
> to zero the range, the new unwritten extents cause iomap to miss dirty
> pages that fuse should've unmapped.  Ooops.
> 

I don't quite follow. How do you mean it misses dirty pages?

Brian

[1] Details described in the commit log of fde4c4c3ec1c ("iomap: elide
flush from partial eof zero range").

> --D
> 
> > ---
> >  fs/iomap/buffered-io.c | 24 ------------------------
> >  1 file changed, 24 deletions(-)
> > 
> > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > index 194e3cc0857f..d2bbed692c06 100644
> > --- a/fs/iomap/buffered-io.c
> > +++ b/fs/iomap/buffered-io.c
> > @@ -1484,33 +1484,9 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
> >  		.private	= private,
> >  	};
> >  	struct address_space *mapping = inode->i_mapping;
> > -	unsigned int blocksize = i_blocksize(inode);
> > -	unsigned int off = pos & (blocksize - 1);
> > -	loff_t plen = min_t(loff_t, len, blocksize - off);
> >  	int ret;
> >  	bool range_dirty;
> >  
> > -	/*
> > -	 * Zero range can skip mappings that are zero on disk so long as
> > -	 * pagecache is clean. If pagecache was dirty prior to zero range, the
> > -	 * mapping converts on writeback completion and so must be zeroed.
> > -	 *
> > -	 * The simplest way to deal with this across a range is to flush
> > -	 * pagecache and process the updated mappings. To avoid excessive
> > -	 * flushing on partial eof zeroing, special case it to zero the
> > -	 * unaligned start portion if already dirty in pagecache.
> > -	 */
> > -	if (!iter.fbatch && off &&
> > -	    filemap_range_needs_writeback(mapping, pos, pos + plen - 1)) {
> > -		iter.len = plen;
> > -		while ((ret = iomap_iter(&iter, ops)) > 0)
> > -			iter.status = iomap_zero_iter(&iter, did_zero);
> > -
> > -		iter.len = len - (iter.pos - pos);
> > -		if (ret || !iter.len)
> > -			return ret;
> > -	}
> > -
> >  	/*
> >  	 * To avoid an unconditional flush, check pagecache state and only flush
> >  	 * if dirty and the fs returns a mapping that might convert on
> > -- 
> > 2.50.0
> > 
> > 
> 



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 7/7] xfs: error tag to force zeroing on debug kernels
  2025-07-15  5:24   ` Darrick J. Wong
@ 2025-07-15 12:39     ` Brian Foster
  2025-07-15 14:30       ` Darrick J. Wong
  0 siblings, 1 reply; 37+ messages in thread
From: Brian Foster @ 2025-07-15 12:39 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, linux-xfs, linux-mm, hch, willy

On Mon, Jul 14, 2025 at 10:24:44PM -0700, Darrick J. Wong wrote:
> On Mon, Jul 14, 2025 at 04:41:22PM -0400, Brian Foster wrote:
> > iomap_zero_range() has to cover various corner cases that are
> > difficult to test on production kernels because it is used in fairly
> > limited use cases. For example, it is currently only used by XFS and
> > mostly only in partial block zeroing cases.
> > 
> > While it's possible to test most of these functional cases, we can
> > provide more robust test coverage by co-opting fallocate zero range
> > to invoke zeroing of the entire range instead of the more efficient
> > block punch/allocate sequence. Add an errortag to occasionally
> > invoke forced zeroing.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > ---
> >  fs/xfs/libxfs/xfs_errortag.h |  4 +++-
> >  fs/xfs/xfs_error.c           |  3 +++
> >  fs/xfs/xfs_file.c            | 26 ++++++++++++++++++++------
> >  3 files changed, 26 insertions(+), 7 deletions(-)
> > 
...
> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > index 0b41b18debf3..c865f9555b77 100644
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -27,6 +27,8 @@
> >  #include "xfs_file.h"
> >  #include "xfs_aops.h"
> >  #include "xfs_zone_alloc.h"
> > +#include "xfs_error.h"
> > +#include "xfs_errortag.h"
> >  
> >  #include <linux/dax.h>
> >  #include <linux/falloc.h>
> > @@ -1269,13 +1271,25 @@ xfs_falloc_zero_range(
> >  	if (error)
> >  		return error;
> >  
> > -	error = xfs_free_file_space(XFS_I(inode), offset, len, ac);
> > -	if (error)
> > -		return error;
> > +	/*
> > +	 * Zero range implements a full zeroing mechanism but is only used in
> > +	 * limited situations. It is more efficient to allocate unwritten
> > +	 * extents than to perform zeroing here, so use an errortag to randomly
> > +	 * force zeroing on DEBUG kernels for added test coverage.
> > +	 */
> > +	if (XFS_TEST_ERROR(false, XFS_I(inode)->i_mount,
> > +			   XFS_ERRTAG_FORCE_ZERO_RANGE)) {
> > +		error = xfs_zero_range(XFS_I(inode), offset, len, ac, NULL);
> 
> Isn't this basically the ultra slow version fallback version of
> FALLOC_FL_WRITE_ZEROES ?
> 

~/linux$ git grep FALLOC_FL_WRITE_ZEROES
~/linux$ 

IIRC write zeroes is intended to expose fast hardware (physical) zeroing
(i.e. zeroed written extents)..? If so, I suppose you could consider
this a fallback of sorts. I'm not sure what write zeroes is expected to
do in the unwritten extent case, whereas iomap zero range is happy to
skip those mappings unless they're already dirty in pagecache.

Regardless, the purpose of this patch is not to add support for physical
zeroing, but rather to increase test coverage for the additional code on
debug kernels because the production use case tends to be more limited.
This could easily be moved/applied to write zeroes if it makes sense in
the future and test infra grows support for it.

Brian

> --D
> 
> > +	} else {
> > +		error = xfs_free_file_space(XFS_I(inode), offset, len, ac);
> > +		if (error)
> > +			return error;
> >  
> > -	len = round_up(offset + len, blksize) - round_down(offset, blksize);
> > -	offset = round_down(offset, blksize);
> > -	error = xfs_alloc_file_space(XFS_I(inode), offset, len);
> > +		len = round_up(offset + len, blksize) -
> > +			round_down(offset, blksize);
> > +		offset = round_down(offset, blksize);
> > +		error = xfs_alloc_file_space(XFS_I(inode), offset, len);
> > +	}
> >  	if (error)
> >  		return error;
> >  	return xfs_falloc_setsize(file, new_size);
> > -- 
> > 2.50.0
> > 
> > 
> 



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 5/7] xfs: fill dirty folios on zero range of unwritten mappings
  2025-07-15 12:35     ` Brian Foster
@ 2025-07-15 14:19       ` Darrick J. Wong
  0 siblings, 0 replies; 37+ messages in thread
From: Darrick J. Wong @ 2025-07-15 14:19 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, linux-xfs, linux-mm, hch, willy

On Tue, Jul 15, 2025 at 08:35:51AM -0400, Brian Foster wrote:
> On Mon, Jul 14, 2025 at 10:28:11PM -0700, Darrick J. Wong wrote:
> > On Mon, Jul 14, 2025 at 04:41:20PM -0400, Brian Foster wrote:
> > > Use the iomap folio batch mechanism to select folios to zero on zero
> > > range of unwritten mappings. Trim the resulting mapping if the batch
> > > is filled (unlikely for current use cases) to distinguish between a
> > > range to skip and one that requires another iteration due to a full
> > > batch.
> > > 
> > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > > ---
> > >  fs/xfs/xfs_iomap.c | 23 +++++++++++++++++++++++
> > >  1 file changed, 23 insertions(+)
> > > 
> > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > > index b5cf5bc6308d..63054f7ead0e 100644
> > > --- a/fs/xfs/xfs_iomap.c
> > > +++ b/fs/xfs/xfs_iomap.c
> > > @@ -1691,6 +1691,8 @@ xfs_buffered_write_iomap_begin(
> > >  	struct iomap		*iomap,
> > >  	struct iomap		*srcmap)
> > >  {
> > > +	struct iomap_iter	*iter = container_of(iomap, struct iomap_iter,
> > > +						     iomap);
> > >  	struct xfs_inode	*ip = XFS_I(inode);
> > >  	struct xfs_mount	*mp = ip->i_mount;
> > >  	xfs_fileoff_t		offset_fsb = XFS_B_TO_FSBT(mp, offset);
> > > @@ -1762,6 +1764,7 @@ xfs_buffered_write_iomap_begin(
> > >  	 */
> > >  	if (flags & IOMAP_ZERO) {
> > >  		xfs_fileoff_t eof_fsb = XFS_B_TO_FSB(mp, XFS_ISIZE(ip));
> > > +		u64 end;
> > >  
> > >  		if (isnullstartblock(imap.br_startblock) &&
> > >  		    offset_fsb >= eof_fsb)
> > > @@ -1769,6 +1772,26 @@ xfs_buffered_write_iomap_begin(
> > >  		if (offset_fsb < eof_fsb && end_fsb > eof_fsb)
> > >  			end_fsb = eof_fsb;
> > >  
> > > +		/*
> > > +		 * Look up dirty folios for unwritten mappings within EOF.
> > > +		 * Providing this bypasses the flush iomap uses to trigger
> > > +		 * extent conversion when unwritten mappings have dirty
> > > +		 * pagecache in need of zeroing.
> > > +		 *
> > > +		 * Trim the mapping to the end pos of the lookup, which in turn
> > > +		 * was trimmed to the end of the batch if it became full before
> > > +		 * the end of the mapping.
> > > +		 */
> > > +		if (imap.br_state == XFS_EXT_UNWRITTEN &&
> > > +		    offset_fsb < eof_fsb) {
> > > +			loff_t len = min(count,
> > > +					 XFS_FSB_TO_B(mp, imap.br_blockcount));
> > > +
> > > +			end = iomap_fill_dirty_folios(iter, offset, len);
> > > +			end_fsb = min_t(xfs_fileoff_t, end_fsb,
> > > +					XFS_B_TO_FSB(mp, end));
> > 
> > Hrmm.  XFS_B_TO_FSB and not _FSBT?  Can the rounding up behavior result
> > in a missed byte range?  I think the answer is no because @end should be
> > aligned to a folio boundary, and folios can't be smaller than an
> > fsblock.
> > 
> 
> Hmm.. not that I'm aware of..? Please elaborate if there's a case you're
> suspicious of because I could have certainly got my wires crossed.

I don't have a specific case in mind.  I saw the conversion function and
thought "well, what IF the return value from iomap_fill_dirty_folios
isn't aligned to a fsblock?" and then went around trying to prove that
isn't possible. :)

> My thinking is that end_fsb reflects the first fsb beyond the target
> range. I.e., it's calculated and used as such in xfs_iomap_end_fsb() and
> the various xfs_trim_extent() calls throughout the rest of the function.

<nod> So I think we're fine here.

--D

> Brian
> 
> > If the answer to the second question is indeed "no" then I think this is
> > ok and
> > Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
> > 
> > --D
> > 
> > 
> > > +		}
> > > +
> > >  		xfs_trim_extent(&imap, offset_fsb, end_fsb - offset_fsb);
> > >  	}
> > >  
> > > -- 
> > > 2.50.0
> > > 
> > > 
> > 
> 
> 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 7/7] xfs: error tag to force zeroing on debug kernels
  2025-07-15 12:39     ` Brian Foster
@ 2025-07-15 14:30       ` Darrick J. Wong
  2025-07-15 16:20         ` Brian Foster
  0 siblings, 1 reply; 37+ messages in thread
From: Darrick J. Wong @ 2025-07-15 14:30 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, linux-xfs, linux-mm, hch, willy

On Tue, Jul 15, 2025 at 08:39:03AM -0400, Brian Foster wrote:
> On Mon, Jul 14, 2025 at 10:24:44PM -0700, Darrick J. Wong wrote:
> > On Mon, Jul 14, 2025 at 04:41:22PM -0400, Brian Foster wrote:
> > > iomap_zero_range() has to cover various corner cases that are
> > > difficult to test on production kernels because it is used in fairly
> > > limited use cases. For example, it is currently only used by XFS and
> > > mostly only in partial block zeroing cases.
> > > 
> > > While it's possible to test most of these functional cases, we can
> > > provide more robust test coverage by co-opting fallocate zero range
> > > to invoke zeroing of the entire range instead of the more efficient
> > > block punch/allocate sequence. Add an errortag to occasionally
> > > invoke forced zeroing.
> > > 
> > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > > ---
> > >  fs/xfs/libxfs/xfs_errortag.h |  4 +++-
> > >  fs/xfs/xfs_error.c           |  3 +++
> > >  fs/xfs/xfs_file.c            | 26 ++++++++++++++++++++------
> > >  3 files changed, 26 insertions(+), 7 deletions(-)
> > > 
> ...
> > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > > index 0b41b18debf3..c865f9555b77 100644
> > > --- a/fs/xfs/xfs_file.c
> > > +++ b/fs/xfs/xfs_file.c
> > > @@ -27,6 +27,8 @@
> > >  #include "xfs_file.h"
> > >  #include "xfs_aops.h"
> > >  #include "xfs_zone_alloc.h"
> > > +#include "xfs_error.h"
> > > +#include "xfs_errortag.h"
> > >  
> > >  #include <linux/dax.h>
> > >  #include <linux/falloc.h>
> > > @@ -1269,13 +1271,25 @@ xfs_falloc_zero_range(
> > >  	if (error)
> > >  		return error;
> > >  
> > > -	error = xfs_free_file_space(XFS_I(inode), offset, len, ac);
> > > -	if (error)
> > > -		return error;
> > > +	/*
> > > +	 * Zero range implements a full zeroing mechanism but is only used in
> > > +	 * limited situations. It is more efficient to allocate unwritten
> > > +	 * extents than to perform zeroing here, so use an errortag to randomly
> > > +	 * force zeroing on DEBUG kernels for added test coverage.
> > > +	 */
> > > +	if (XFS_TEST_ERROR(false, XFS_I(inode)->i_mount,
> > > +			   XFS_ERRTAG_FORCE_ZERO_RANGE)) {
> > > +		error = xfs_zero_range(XFS_I(inode), offset, len, ac, NULL);
> > 
> > Isn't this basically the ultra slow version fallback version of
> > FALLOC_FL_WRITE_ZEROES ?
> > 
> 
> ~/linux$ git grep FALLOC_FL_WRITE_ZEROES
> ~/linux$ 
> 
> IIRC write zeroes is intended to expose fast hardware (physical) zeroing
> (i.e. zeroed written extents)..? If so, I suppose you could consider
> this a fallback of sorts. I'm not sure what write zeroes is expected to
> do in the unwritten extent case, whereas iomap zero range is happy to
> skip those mappings unless they're already dirty in pagecache.

Sorry, forgot that they weren't wiring anything up in xfs so it never
showed up here:
https://lore.kernel.org/linux-fsdevel/20250619111806.3546162-1-yi.zhang@huaweicloud.com/

Basically they want to avoid the unwritten extent conversion overhead by
providing a way to preallocate written zeroed extents and sending magic
commands to hardware that unmaps LBAs in such a way that rereads return
zero.

> Regardless, the purpose of this patch is not to add support for physical
> zeroing, but rather to increase test coverage for the additional code on
> debug kernels because the production use case tends to be more limited.
> This could easily be moved/applied to write zeroes if it makes sense in
> the future and test infra grows support for it.

<nod> On second look, I don't think the new fallocate flag allows for
letting the kernel do pagecache zeroing + flush.  Admittedly that would
be beside the point (and userspaces already do that anyway).

Anyway enough mumbling from me,
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

--D

> Brian
> 
> > --D
> > 
> > > +	} else {
> > > +		error = xfs_free_file_space(XFS_I(inode), offset, len, ac);
> > > +		if (error)
> > > +			return error;
> > >  
> > > -	len = round_up(offset + len, blksize) - round_down(offset, blksize);
> > > -	offset = round_down(offset, blksize);
> > > -	error = xfs_alloc_file_space(XFS_I(inode), offset, len);
> > > +		len = round_up(offset + len, blksize) -
> > > +			round_down(offset, blksize);
> > > +		offset = round_down(offset, blksize);
> > > +		error = xfs_alloc_file_space(XFS_I(inode), offset, len);
> > > +	}
> > >  	if (error)
> > >  		return error;
> > >  	return xfs_falloc_setsize(file, new_size);
> > > -- 
> > > 2.50.0
> > > 
> > > 
> > 
> 
> 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 6/7] iomap: remove old partial eof zeroing optimization
  2025-07-15 12:36     ` Brian Foster
@ 2025-07-15 14:37       ` Darrick J. Wong
  2025-07-15 16:20         ` Brian Foster
  0 siblings, 1 reply; 37+ messages in thread
From: Darrick J. Wong @ 2025-07-15 14:37 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, linux-xfs, linux-mm, hch, willy

On Tue, Jul 15, 2025 at 08:36:54AM -0400, Brian Foster wrote:
> On Mon, Jul 14, 2025 at 10:34:17PM -0700, Darrick J. Wong wrote:
> > On Mon, Jul 14, 2025 at 04:41:21PM -0400, Brian Foster wrote:
> > > iomap_zero_range() optimizes the partial eof block zeroing use case
> > > by force zeroing if the mapping is dirty. This is to avoid frequent
> > > flushing on file extending workloads, which hurts performance.
> > > 
> > > Now that the folio batch mechanism provides a more generic solution
> > > and is used by the only real zero range user (XFS), this isolated
> > > optimization is no longer needed. Remove the unnecessary code and
> > > let callers use the folio batch or fall back to flushing by default.
> > > 
> > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > 
> > Heh, I was staring at this last Friday chasing fuse+iomap bugs in
> > fallocate zerorange and straining to remember what this does.
> > Is this chunk still needed if the ->iomap_begin implementation doesn't
> > (or forgets to) grab the folio batch for iomap?
> > 
> 
> No, the hunk removed by this patch is just an optimization. The fallback
> code here flushes the range if it's dirty and retries the lookup (i.e.
> picking up unwritten conversions that were pending via dirty pagecache).
> That flush logic caused a performance regression in a particular
> workload, so this was introduced to mitigate that regression by just
> doing the zeroing for the first block or so if the folio is dirty. [1]
> 
> The reason for removing it is more just for maintainability. XFS is
> really the only user here and it is changing over to the more generic
> batch mechanism, which effectively provides the same optimization, so
> this basically becomes dead/duplicate code. If an fs doesn't use the
> batch mechanism it will just fall back to the flush and retry approach,
> which can be slower but is functionally correct.

Oh ok thanks for the reminder.
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

> > My bug turned out to be a bug in my fuse+iomap design -- with the way
> > iomap_zero_range does things, you have to flush+unmap, punch the range
> > and zero the range.  If you punch and realloc the range and *then* try
> > to zero the range, the new unwritten extents cause iomap to miss dirty
> > pages that fuse should've unmapped.  Ooops.
> > 
> 
> I don't quite follow. How do you mean it misses dirty pages?

Oops, I misspoke, the folios were clean.  Let's say the pagecache is
sparsely populated with some folios for written space:

-------fffff-------fffffff
wwwwwwwwwwwwwwwwwwwwwwwwww

Now you tell it to go zero range the middle.  fuse's fallocate code
issues the upcall to userspace, whch changes some mappings:

-------fffff-------fffffff
wwwwwuuuuuuuuuuuwwwwwwwwww

Only after the upcall returns does the kernel try to do the pagecache
zeroing.  Unfortunately, the mapping changed to unwritten so
iomap_zero_range doesn't see the "fffff" and leaves its contents intact.

(Note: Non-iomap fuse defers everything to the fuse server so this isn't
a problem if the fuse server does all the zeroing itself.)

--D

> Brian
> 
> [1] Details described in the commit log of fde4c4c3ec1c ("iomap: elide
> flush from partial eof zero range").
> 
> > --D
> > 
> > > ---
> > >  fs/iomap/buffered-io.c | 24 ------------------------
> > >  1 file changed, 24 deletions(-)
> > > 
> > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > > index 194e3cc0857f..d2bbed692c06 100644
> > > --- a/fs/iomap/buffered-io.c
> > > +++ b/fs/iomap/buffered-io.c
> > > @@ -1484,33 +1484,9 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
> > >  		.private	= private,
> > >  	};
> > >  	struct address_space *mapping = inode->i_mapping;
> > > -	unsigned int blocksize = i_blocksize(inode);
> > > -	unsigned int off = pos & (blocksize - 1);
> > > -	loff_t plen = min_t(loff_t, len, blocksize - off);
> > >  	int ret;
> > >  	bool range_dirty;
> > >  
> > > -	/*
> > > -	 * Zero range can skip mappings that are zero on disk so long as
> > > -	 * pagecache is clean. If pagecache was dirty prior to zero range, the
> > > -	 * mapping converts on writeback completion and so must be zeroed.
> > > -	 *
> > > -	 * The simplest way to deal with this across a range is to flush
> > > -	 * pagecache and process the updated mappings. To avoid excessive
> > > -	 * flushing on partial eof zeroing, special case it to zero the
> > > -	 * unaligned start portion if already dirty in pagecache.
> > > -	 */
> > > -	if (!iter.fbatch && off &&
> > > -	    filemap_range_needs_writeback(mapping, pos, pos + plen - 1)) {
> > > -		iter.len = plen;
> > > -		while ((ret = iomap_iter(&iter, ops)) > 0)
> > > -			iter.status = iomap_zero_iter(&iter, did_zero);
> > > -
> > > -		iter.len = len - (iter.pos - pos);
> > > -		if (ret || !iter.len)
> > > -			return ret;
> > > -	}
> > > -
> > >  	/*
> > >  	 * To avoid an unconditional flush, check pagecache state and only flush
> > >  	 * if dirty and the fs returns a mapping that might convert on
> > > -- 
> > > 2.50.0
> > > 
> > > 
> > 
> 
> 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 6/7] iomap: remove old partial eof zeroing optimization
  2025-07-15 14:37       ` Darrick J. Wong
@ 2025-07-15 16:20         ` Brian Foster
  2025-07-15 16:30           ` Darrick J. Wong
  0 siblings, 1 reply; 37+ messages in thread
From: Brian Foster @ 2025-07-15 16:20 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, linux-xfs, linux-mm, hch, willy

On Tue, Jul 15, 2025 at 07:37:33AM -0700, Darrick J. Wong wrote:
> On Tue, Jul 15, 2025 at 08:36:54AM -0400, Brian Foster wrote:
> > On Mon, Jul 14, 2025 at 10:34:17PM -0700, Darrick J. Wong wrote:
> > > On Mon, Jul 14, 2025 at 04:41:21PM -0400, Brian Foster wrote:
> > > > iomap_zero_range() optimizes the partial eof block zeroing use case
> > > > by force zeroing if the mapping is dirty. This is to avoid frequent
> > > > flushing on file extending workloads, which hurts performance.
> > > > 
> > > > Now that the folio batch mechanism provides a more generic solution
> > > > and is used by the only real zero range user (XFS), this isolated
> > > > optimization is no longer needed. Remove the unnecessary code and
> > > > let callers use the folio batch or fall back to flushing by default.
> > > > 
> > > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > > 
> > > Heh, I was staring at this last Friday chasing fuse+iomap bugs in
> > > fallocate zerorange and straining to remember what this does.
> > > Is this chunk still needed if the ->iomap_begin implementation doesn't
> > > (or forgets to) grab the folio batch for iomap?
> > > 
> > 
> > No, the hunk removed by this patch is just an optimization. The fallback
> > code here flushes the range if it's dirty and retries the lookup (i.e.
> > picking up unwritten conversions that were pending via dirty pagecache).
> > That flush logic caused a performance regression in a particular
> > workload, so this was introduced to mitigate that regression by just
> > doing the zeroing for the first block or so if the folio is dirty. [1]
> > 
> > The reason for removing it is more just for maintainability. XFS is
> > really the only user here and it is changing over to the more generic
> > batch mechanism, which effectively provides the same optimization, so
> > this basically becomes dead/duplicate code. If an fs doesn't use the
> > batch mechanism it will just fall back to the flush and retry approach,
> > which can be slower but is functionally correct.
> 
> Oh ok thanks for the reminder.
> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
> 
> > > My bug turned out to be a bug in my fuse+iomap design -- with the way
> > > iomap_zero_range does things, you have to flush+unmap, punch the range
> > > and zero the range.  If you punch and realloc the range and *then* try
> > > to zero the range, the new unwritten extents cause iomap to miss dirty
> > > pages that fuse should've unmapped.  Ooops.
> > > 
> > 
> > I don't quite follow. How do you mean it misses dirty pages?
> 
> Oops, I misspoke, the folios were clean.  Let's say the pagecache is
> sparsely populated with some folios for written space:
> 
> -------fffff-------fffffff
> wwwwwwwwwwwwwwwwwwwwwwwwww
> 
> Now you tell it to go zero range the middle.  fuse's fallocate code
> issues the upcall to userspace, whch changes some mappings:
> 
> -------fffff-------fffffff
> wwwwwuuuuuuuuuuuwwwwwwwwww
> 
> Only after the upcall returns does the kernel try to do the pagecache
> zeroing.  Unfortunately, the mapping changed to unwritten so
> iomap_zero_range doesn't see the "fffff" and leaves its contents intact.
> 

Ah, interesting. So presumably the fuse fs is not doing any cache
managment, and this creates an unexpected inconsistency between
pagecache and block state.

So what's the solution to this for fuse+iomap? Invalidate the cache
range before or after the callback or something?

Brian

> (Note: Non-iomap fuse defers everything to the fuse server so this isn't
> a problem if the fuse server does all the zeroing itself.)
> 
> --D
> 
> > Brian
> > 
> > [1] Details described in the commit log of fde4c4c3ec1c ("iomap: elide
> > flush from partial eof zero range").
> > 
> > > --D
> > > 
> > > > ---
> > > >  fs/iomap/buffered-io.c | 24 ------------------------
> > > >  1 file changed, 24 deletions(-)
> > > > 
> > > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > > > index 194e3cc0857f..d2bbed692c06 100644
> > > > --- a/fs/iomap/buffered-io.c
> > > > +++ b/fs/iomap/buffered-io.c
> > > > @@ -1484,33 +1484,9 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
> > > >  		.private	= private,
> > > >  	};
> > > >  	struct address_space *mapping = inode->i_mapping;
> > > > -	unsigned int blocksize = i_blocksize(inode);
> > > > -	unsigned int off = pos & (blocksize - 1);
> > > > -	loff_t plen = min_t(loff_t, len, blocksize - off);
> > > >  	int ret;
> > > >  	bool range_dirty;
> > > >  
> > > > -	/*
> > > > -	 * Zero range can skip mappings that are zero on disk so long as
> > > > -	 * pagecache is clean. If pagecache was dirty prior to zero range, the
> > > > -	 * mapping converts on writeback completion and so must be zeroed.
> > > > -	 *
> > > > -	 * The simplest way to deal with this across a range is to flush
> > > > -	 * pagecache and process the updated mappings. To avoid excessive
> > > > -	 * flushing on partial eof zeroing, special case it to zero the
> > > > -	 * unaligned start portion if already dirty in pagecache.
> > > > -	 */
> > > > -	if (!iter.fbatch && off &&
> > > > -	    filemap_range_needs_writeback(mapping, pos, pos + plen - 1)) {
> > > > -		iter.len = plen;
> > > > -		while ((ret = iomap_iter(&iter, ops)) > 0)
> > > > -			iter.status = iomap_zero_iter(&iter, did_zero);
> > > > -
> > > > -		iter.len = len - (iter.pos - pos);
> > > > -		if (ret || !iter.len)
> > > > -			return ret;
> > > > -	}
> > > > -
> > > >  	/*
> > > >  	 * To avoid an unconditional flush, check pagecache state and only flush
> > > >  	 * if dirty and the fs returns a mapping that might convert on
> > > > -- 
> > > > 2.50.0
> > > > 
> > > > 
> > > 
> > 
> > 
> 



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 7/7] xfs: error tag to force zeroing on debug kernels
  2025-07-15 14:30       ` Darrick J. Wong
@ 2025-07-15 16:20         ` Brian Foster
  0 siblings, 0 replies; 37+ messages in thread
From: Brian Foster @ 2025-07-15 16:20 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, linux-xfs, linux-mm, hch, willy

On Tue, Jul 15, 2025 at 07:30:41AM -0700, Darrick J. Wong wrote:
> On Tue, Jul 15, 2025 at 08:39:03AM -0400, Brian Foster wrote:
> > On Mon, Jul 14, 2025 at 10:24:44PM -0700, Darrick J. Wong wrote:
> > > On Mon, Jul 14, 2025 at 04:41:22PM -0400, Brian Foster wrote:
> > > > iomap_zero_range() has to cover various corner cases that are
> > > > difficult to test on production kernels because it is used in fairly
> > > > limited use cases. For example, it is currently only used by XFS and
> > > > mostly only in partial block zeroing cases.
> > > > 
> > > > While it's possible to test most of these functional cases, we can
> > > > provide more robust test coverage by co-opting fallocate zero range
> > > > to invoke zeroing of the entire range instead of the more efficient
> > > > block punch/allocate sequence. Add an errortag to occasionally
> > > > invoke forced zeroing.
> > > > 
> > > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > > > ---
> > > >  fs/xfs/libxfs/xfs_errortag.h |  4 +++-
> > > >  fs/xfs/xfs_error.c           |  3 +++
> > > >  fs/xfs/xfs_file.c            | 26 ++++++++++++++++++++------
> > > >  3 files changed, 26 insertions(+), 7 deletions(-)
> > > > 
> > ...
> > > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > > > index 0b41b18debf3..c865f9555b77 100644
> > > > --- a/fs/xfs/xfs_file.c
> > > > +++ b/fs/xfs/xfs_file.c
> > > > @@ -27,6 +27,8 @@
> > > >  #include "xfs_file.h"
> > > >  #include "xfs_aops.h"
> > > >  #include "xfs_zone_alloc.h"
> > > > +#include "xfs_error.h"
> > > > +#include "xfs_errortag.h"
> > > >  
> > > >  #include <linux/dax.h>
> > > >  #include <linux/falloc.h>
> > > > @@ -1269,13 +1271,25 @@ xfs_falloc_zero_range(
> > > >  	if (error)
> > > >  		return error;
> > > >  
> > > > -	error = xfs_free_file_space(XFS_I(inode), offset, len, ac);
> > > > -	if (error)
> > > > -		return error;
> > > > +	/*
> > > > +	 * Zero range implements a full zeroing mechanism but is only used in
> > > > +	 * limited situations. It is more efficient to allocate unwritten
> > > > +	 * extents than to perform zeroing here, so use an errortag to randomly
> > > > +	 * force zeroing on DEBUG kernels for added test coverage.
> > > > +	 */
> > > > +	if (XFS_TEST_ERROR(false, XFS_I(inode)->i_mount,
> > > > +			   XFS_ERRTAG_FORCE_ZERO_RANGE)) {
> > > > +		error = xfs_zero_range(XFS_I(inode), offset, len, ac, NULL);
> > > 
> > > Isn't this basically the ultra slow version fallback version of
> > > FALLOC_FL_WRITE_ZEROES ?
> > > 
> > 
> > ~/linux$ git grep FALLOC_FL_WRITE_ZEROES
> > ~/linux$ 
> > 
> > IIRC write zeroes is intended to expose fast hardware (physical) zeroing
> > (i.e. zeroed written extents)..? If so, I suppose you could consider
> > this a fallback of sorts. I'm not sure what write zeroes is expected to
> > do in the unwritten extent case, whereas iomap zero range is happy to
> > skip those mappings unless they're already dirty in pagecache.
> 
> Sorry, forgot that they weren't wiring anything up in xfs so it never
> showed up here:
> https://lore.kernel.org/linux-fsdevel/20250619111806.3546162-1-yi.zhang@huaweicloud.com/
> 
> Basically they want to avoid the unwritten extent conversion overhead by
> providing a way to preallocate written zeroed extents and sending magic
> commands to hardware that unmaps LBAs in such a way that rereads return
> zero.
> 

Ack.. I'd seen that before, but hadn't looked too closely and wasn't
sure what the status was.

> > Regardless, the purpose of this patch is not to add support for physical
> > zeroing, but rather to increase test coverage for the additional code on
> > debug kernels because the production use case tends to be more limited.
> > This could easily be moved/applied to write zeroes if it makes sense in
> > the future and test infra grows support for it.
> 
> <nod> On second look, I don't think the new fallocate flag allows for
> letting the kernel do pagecache zeroing + flush.  Admittedly that would
> be beside the point (and userspaces already do that anyway).
> 

Ok. Thanks for the reviews.

Brian

> Anyway enough mumbling from me,
> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
> 
> --D
> 
> > Brian
> > 
> > > --D
> > > 
> > > > +	} else {
> > > > +		error = xfs_free_file_space(XFS_I(inode), offset, len, ac);
> > > > +		if (error)
> > > > +			return error;
> > > >  
> > > > -	len = round_up(offset + len, blksize) - round_down(offset, blksize);
> > > > -	offset = round_down(offset, blksize);
> > > > -	error = xfs_alloc_file_space(XFS_I(inode), offset, len);
> > > > +		len = round_up(offset + len, blksize) -
> > > > +			round_down(offset, blksize);
> > > > +		offset = round_down(offset, blksize);
> > > > +		error = xfs_alloc_file_space(XFS_I(inode), offset, len);
> > > > +	}
> > > >  	if (error)
> > > >  		return error;
> > > >  	return xfs_falloc_setsize(file, new_size);
> > > > -- 
> > > > 2.50.0
> > > > 
> > > > 
> > > 
> > 
> > 
> 



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 6/7] iomap: remove old partial eof zeroing optimization
  2025-07-15 16:20         ` Brian Foster
@ 2025-07-15 16:30           ` Darrick J. Wong
  0 siblings, 0 replies; 37+ messages in thread
From: Darrick J. Wong @ 2025-07-15 16:30 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, linux-xfs, linux-mm, hch, willy

On Tue, Jul 15, 2025 at 12:20:14PM -0400, Brian Foster wrote:
> On Tue, Jul 15, 2025 at 07:37:33AM -0700, Darrick J. Wong wrote:
> > On Tue, Jul 15, 2025 at 08:36:54AM -0400, Brian Foster wrote:
> > > On Mon, Jul 14, 2025 at 10:34:17PM -0700, Darrick J. Wong wrote:
> > > > On Mon, Jul 14, 2025 at 04:41:21PM -0400, Brian Foster wrote:
> > > > > iomap_zero_range() optimizes the partial eof block zeroing use case
> > > > > by force zeroing if the mapping is dirty. This is to avoid frequent
> > > > > flushing on file extending workloads, which hurts performance.
> > > > > 
> > > > > Now that the folio batch mechanism provides a more generic solution
> > > > > and is used by the only real zero range user (XFS), this isolated
> > > > > optimization is no longer needed. Remove the unnecessary code and
> > > > > let callers use the folio batch or fall back to flushing by default.
> > > > > 
> > > > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > > > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > > > 
> > > > Heh, I was staring at this last Friday chasing fuse+iomap bugs in
> > > > fallocate zerorange and straining to remember what this does.
> > > > Is this chunk still needed if the ->iomap_begin implementation doesn't
> > > > (or forgets to) grab the folio batch for iomap?
> > > > 
> > > 
> > > No, the hunk removed by this patch is just an optimization. The fallback
> > > code here flushes the range if it's dirty and retries the lookup (i.e.
> > > picking up unwritten conversions that were pending via dirty pagecache).
> > > That flush logic caused a performance regression in a particular
> > > workload, so this was introduced to mitigate that regression by just
> > > doing the zeroing for the first block or so if the folio is dirty. [1]
> > > 
> > > The reason for removing it is more just for maintainability. XFS is
> > > really the only user here and it is changing over to the more generic
> > > batch mechanism, which effectively provides the same optimization, so
> > > this basically becomes dead/duplicate code. If an fs doesn't use the
> > > batch mechanism it will just fall back to the flush and retry approach,
> > > which can be slower but is functionally correct.
> > 
> > Oh ok thanks for the reminder.
> > Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
> > 
> > > > My bug turned out to be a bug in my fuse+iomap design -- with the way
> > > > iomap_zero_range does things, you have to flush+unmap, punch the range
> > > > and zero the range.  If you punch and realloc the range and *then* try
> > > > to zero the range, the new unwritten extents cause iomap to miss dirty
> > > > pages that fuse should've unmapped.  Ooops.
> > > > 
> > > 
> > > I don't quite follow. How do you mean it misses dirty pages?
> > 
> > Oops, I misspoke, the folios were clean.  Let's say the pagecache is
> > sparsely populated with some folios for written space:
> > 
> > -------fffff-------fffffff
> > wwwwwwwwwwwwwwwwwwwwwwwwww
> > 
> > Now you tell it to go zero range the middle.  fuse's fallocate code
> > issues the upcall to userspace, whch changes some mappings:
> > 
> > -------fffff-------fffffff
> > wwwwwuuuuuuuuuuuwwwwwwwwww
> > 
> > Only after the upcall returns does the kernel try to do the pagecache
> > zeroing.  Unfortunately, the mapping changed to unwritten so
> > iomap_zero_range doesn't see the "fffff" and leaves its contents intact.
> > 
> 
> Ah, interesting. So presumably the fuse fs is not doing any cache
> managment, and this creates an unexpected inconsistency between
> pagecache and block state.
> 
> So what's the solution to this for fuse+iomap? Invalidate the cache
> range before or after the callback or something?

Port xfs_flush_unmap_range, I think.

--D

> Brian
> 
> > (Note: Non-iomap fuse defers everything to the fuse server so this isn't
> > a problem if the fuse server does all the zeroing itself.)
> > 
> > --D
> > 
> > > Brian
> > > 
> > > [1] Details described in the commit log of fde4c4c3ec1c ("iomap: elide
> > > flush from partial eof zero range").
> > > 
> > > > --D
> > > > 
> > > > > ---
> > > > >  fs/iomap/buffered-io.c | 24 ------------------------
> > > > >  1 file changed, 24 deletions(-)
> > > > > 
> > > > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > > > > index 194e3cc0857f..d2bbed692c06 100644
> > > > > --- a/fs/iomap/buffered-io.c
> > > > > +++ b/fs/iomap/buffered-io.c
> > > > > @@ -1484,33 +1484,9 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
> > > > >  		.private	= private,
> > > > >  	};
> > > > >  	struct address_space *mapping = inode->i_mapping;
> > > > > -	unsigned int blocksize = i_blocksize(inode);
> > > > > -	unsigned int off = pos & (blocksize - 1);
> > > > > -	loff_t plen = min_t(loff_t, len, blocksize - off);
> > > > >  	int ret;
> > > > >  	bool range_dirty;
> > > > >  
> > > > > -	/*
> > > > > -	 * Zero range can skip mappings that are zero on disk so long as
> > > > > -	 * pagecache is clean. If pagecache was dirty prior to zero range, the
> > > > > -	 * mapping converts on writeback completion and so must be zeroed.
> > > > > -	 *
> > > > > -	 * The simplest way to deal with this across a range is to flush
> > > > > -	 * pagecache and process the updated mappings. To avoid excessive
> > > > > -	 * flushing on partial eof zeroing, special case it to zero the
> > > > > -	 * unaligned start portion if already dirty in pagecache.
> > > > > -	 */
> > > > > -	if (!iter.fbatch && off &&
> > > > > -	    filemap_range_needs_writeback(mapping, pos, pos + plen - 1)) {
> > > > > -		iter.len = plen;
> > > > > -		while ((ret = iomap_iter(&iter, ops)) > 0)
> > > > > -			iter.status = iomap_zero_iter(&iter, did_zero);
> > > > > -
> > > > > -		iter.len = len - (iter.pos - pos);
> > > > > -		if (ret || !iter.len)
> > > > > -			return ret;
> > > > > -	}
> > > > > -
> > > > >  	/*
> > > > >  	 * To avoid an unconditional flush, check pagecache state and only flush
> > > > >  	 * if dirty and the fs returns a mapping that might convert on
> > > > > -- 
> > > > > 2.50.0
> > > > > 
> > > > > 
> > > > 
> > > 
> > > 
> > 
> 
> 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 3/7] iomap: optional zero range dirty folio processing
  2025-07-15  5:22   ` Darrick J. Wong
  2025-07-15 12:35     ` Brian Foster
@ 2025-07-18 11:30     ` Zhang Yi
  2025-07-18 13:48       ` Brian Foster
  1 sibling, 1 reply; 37+ messages in thread
From: Zhang Yi @ 2025-07-18 11:30 UTC (permalink / raw)
  To: Brian Foster
  Cc: linux-fsdevel, linux-xfs, linux-mm, hch, willy, Darrick J. Wong,
	Ext4 Developers List

On 2025/7/15 13:22, Darrick J. Wong wrote:
> On Mon, Jul 14, 2025 at 04:41:18PM -0400, Brian Foster wrote:
>> The only way zero range can currently process unwritten mappings
>> with dirty pagecache is to check whether the range is dirty before
>> mapping lookup and then flush when at least one underlying mapping
>> is unwritten. This ordering is required to prevent iomap lookup from
>> racing with folio writeback and reclaim.
>>
>> Since zero range can skip ranges of unwritten mappings that are
>> clean in cache, this operation can be improved by allowing the
>> filesystem to provide a set of dirty folios that require zeroing. In
>> turn, rather than flush or iterate file offsets, zero range can
>> iterate on folios in the batch and advance over clean or uncached
>> ranges in between.
>>
>> Add a folio_batch in struct iomap and provide a helper for fs' to
> 
> /me confused by the single quote; is this supposed to read:
> 
> "...for the fs to populate..."?
> 
> Either way the code changes look like a reasonable thing to do for the
> pagecache (try to grab a bunch of dirty folios while XFS holds the
> mapping lock) so
> 
> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
> 
> --D
> 
> 
>> populate the batch at lookup time. Update the folio lookup path to
>> return the next folio in the batch, if provided, and advance the
>> iter if the folio starts beyond the current offset.
>>
>> Signed-off-by: Brian Foster <bfoster@redhat.com>
>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>> ---
>>  fs/iomap/buffered-io.c | 89 +++++++++++++++++++++++++++++++++++++++---
>>  fs/iomap/iter.c        |  6 +++
>>  include/linux/iomap.h  |  4 ++
>>  3 files changed, 94 insertions(+), 5 deletions(-)
>>
>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
>> index 38da2fa6e6b0..194e3cc0857f 100644
>> --- a/fs/iomap/buffered-io.c
>> +++ b/fs/iomap/buffered-io.c
[...]
>> @@ -1398,6 +1452,26 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
>>  	return status;
>>  }
>>  
>> +loff_t
>> +iomap_fill_dirty_folios(
>> +	struct iomap_iter	*iter,
>> +	loff_t			offset,
>> +	loff_t			length)
>> +{
>> +	struct address_space	*mapping = iter->inode->i_mapping;
>> +	pgoff_t			start = offset >> PAGE_SHIFT;
>> +	pgoff_t			end = (offset + length - 1) >> PAGE_SHIFT;
>> +
>> +	iter->fbatch = kmalloc(sizeof(struct folio_batch), GFP_KERNEL);
>> +	if (!iter->fbatch)

Hi, Brian!

I think ext4 needs to be aware of this failure after it converts to use
iomap infrastructure. It is because if we fail to add dirty folios to the
fbatch, iomap_zero_range() will flush those unwritten and dirty range.
This could potentially lead to a deadlock, as most calls to
ext4_block_zero_page_range() occur under an active journal handle.
Writeback operations under an active journal handle may result in circular
waiting within journal transactions. So please return this error code, and
then ext4 can interrupt zero operations to prevent deadlock.

Thanks,
Yi.

>> +		return offset + length;
>> +	folio_batch_init(iter->fbatch);
>> +
>> +	filemap_get_folios_dirty(mapping, &start, end, iter->fbatch);
>> +	return (start << PAGE_SHIFT);
>> +}
>> +EXPORT_SYMBOL_GPL(iomap_fill_dirty_folios);



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 3/7] iomap: optional zero range dirty folio processing
  2025-07-18 11:30     ` Zhang Yi
@ 2025-07-18 13:48       ` Brian Foster
  2025-07-19 11:07         ` Zhang Yi
  0 siblings, 1 reply; 37+ messages in thread
From: Brian Foster @ 2025-07-18 13:48 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-fsdevel, linux-xfs, linux-mm, hch, willy, Darrick J. Wong,
	Ext4 Developers List

On Fri, Jul 18, 2025 at 07:30:10PM +0800, Zhang Yi wrote:
> On 2025/7/15 13:22, Darrick J. Wong wrote:
> > On Mon, Jul 14, 2025 at 04:41:18PM -0400, Brian Foster wrote:
> >> The only way zero range can currently process unwritten mappings
> >> with dirty pagecache is to check whether the range is dirty before
> >> mapping lookup and then flush when at least one underlying mapping
> >> is unwritten. This ordering is required to prevent iomap lookup from
> >> racing with folio writeback and reclaim.
> >>
> >> Since zero range can skip ranges of unwritten mappings that are
> >> clean in cache, this operation can be improved by allowing the
> >> filesystem to provide a set of dirty folios that require zeroing. In
> >> turn, rather than flush or iterate file offsets, zero range can
> >> iterate on folios in the batch and advance over clean or uncached
> >> ranges in between.
> >>
> >> Add a folio_batch in struct iomap and provide a helper for fs' to
> > 
> > /me confused by the single quote; is this supposed to read:
> > 
> > "...for the fs to populate..."?
> > 
> > Either way the code changes look like a reasonable thing to do for the
> > pagecache (try to grab a bunch of dirty folios while XFS holds the
> > mapping lock) so
> > 
> > Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
> > 
> > --D
> > 
> > 
> >> populate the batch at lookup time. Update the folio lookup path to
> >> return the next folio in the batch, if provided, and advance the
> >> iter if the folio starts beyond the current offset.
> >>
> >> Signed-off-by: Brian Foster <bfoster@redhat.com>
> >> Reviewed-by: Christoph Hellwig <hch@lst.de>
> >> ---
> >>  fs/iomap/buffered-io.c | 89 +++++++++++++++++++++++++++++++++++++++---
> >>  fs/iomap/iter.c        |  6 +++
> >>  include/linux/iomap.h  |  4 ++
> >>  3 files changed, 94 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> >> index 38da2fa6e6b0..194e3cc0857f 100644
> >> --- a/fs/iomap/buffered-io.c
> >> +++ b/fs/iomap/buffered-io.c
> [...]
> >> @@ -1398,6 +1452,26 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
> >>  	return status;
> >>  }
> >>  
> >> +loff_t
> >> +iomap_fill_dirty_folios(
> >> +	struct iomap_iter	*iter,
> >> +	loff_t			offset,
> >> +	loff_t			length)
> >> +{
> >> +	struct address_space	*mapping = iter->inode->i_mapping;
> >> +	pgoff_t			start = offset >> PAGE_SHIFT;
> >> +	pgoff_t			end = (offset + length - 1) >> PAGE_SHIFT;
> >> +
> >> +	iter->fbatch = kmalloc(sizeof(struct folio_batch), GFP_KERNEL);
> >> +	if (!iter->fbatch)
> 
> Hi, Brian!
> 
> I think ext4 needs to be aware of this failure after it converts to use
> iomap infrastructure. It is because if we fail to add dirty folios to the
> fbatch, iomap_zero_range() will flush those unwritten and dirty range.
> This could potentially lead to a deadlock, as most calls to
> ext4_block_zero_page_range() occur under an active journal handle.
> Writeback operations under an active journal handle may result in circular
> waiting within journal transactions. So please return this error code, and
> then ext4 can interrupt zero operations to prevent deadlock.
> 

Hi Yi,

Thanks for looking at this.

Huh.. so the reason for falling back like this here is just that this
was considered an optional optimization, with the flush in
iomap_zero_range() being default fallback behavior. IIUC, what you're
saying means that the current zero range behavior without this series is
problematic for ext4-on-iomap..? If so, have you observed issues you can
share details about?

FWIW, I think your suggestion is reasonable, but I'm also curious what
the error handling would look like in ext4. Do you expect to the fail
the higher level operation, for example? Cycle locks and retry, etc.?

The reason I ask is because the folio_batch handling has come up through
discussions on this series. My position so far has been to keep it as a
separate allocation and to keep things simple since it is currently
isolated to zero range, but that may change if the usage spills over to
other operations (which seems expected at this point). I suspect that if
a filesystem actually depends on this for correct behavior, that is
another data point worth considering on that topic.

So that has me wondering if it would be better/easier here to perhaps
embed the batch in iomap_iter, or maybe as an incremental step put it on
the stack in iomap_zero_range() and initialize the iomap_iter pointer
there instead of doing the dynamic allocation (then the fill helper
would set a flag to indicate the fs did pagecache lookup). Thoughts on
something like that?

Also IIUC ext4-on-iomap is still a WIP and review on this series seems
to have mostly wound down. Any objection if the fix for that comes along
as a followup patch rather than a rework of this series?

Brian

P.S., I'm heading on vacation so it will likely be a week or two before
I follow up from here, JFYI.

> Thanks,
> Yi.
> 
> >> +		return offset + length;
> >> +	folio_batch_init(iter->fbatch);
> >> +
> >> +	filemap_get_folios_dirty(mapping, &start, end, iter->fbatch);
> >> +	return (start << PAGE_SHIFT);
> >> +}
> >> +EXPORT_SYMBOL_GPL(iomap_fill_dirty_folios);
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 3/7] iomap: optional zero range dirty folio processing
  2025-07-18 13:48       ` Brian Foster
@ 2025-07-19 11:07         ` Zhang Yi
  2025-07-21  8:47           ` Zhang Yi
  2025-07-30 13:17           ` Brian Foster
  0 siblings, 2 replies; 37+ messages in thread
From: Zhang Yi @ 2025-07-19 11:07 UTC (permalink / raw)
  To: Brian Foster
  Cc: linux-fsdevel, linux-xfs, linux-mm, hch, willy, Darrick J. Wong,
	Ext4 Developers List

On 2025/7/18 21:48, Brian Foster wrote:
> On Fri, Jul 18, 2025 at 07:30:10PM +0800, Zhang Yi wrote:
>> On 2025/7/15 13:22, Darrick J. Wong wrote:
>>> On Mon, Jul 14, 2025 at 04:41:18PM -0400, Brian Foster wrote:
>>>> The only way zero range can currently process unwritten mappings
>>>> with dirty pagecache is to check whether the range is dirty before
>>>> mapping lookup and then flush when at least one underlying mapping
>>>> is unwritten. This ordering is required to prevent iomap lookup from
>>>> racing with folio writeback and reclaim.
>>>>
>>>> Since zero range can skip ranges of unwritten mappings that are
>>>> clean in cache, this operation can be improved by allowing the
>>>> filesystem to provide a set of dirty folios that require zeroing. In
>>>> turn, rather than flush or iterate file offsets, zero range can
>>>> iterate on folios in the batch and advance over clean or uncached
>>>> ranges in between.
>>>>
>>>> Add a folio_batch in struct iomap and provide a helper for fs' to
>>>
>>> /me confused by the single quote; is this supposed to read:
>>>
>>> "...for the fs to populate..."?
>>>
>>> Either way the code changes look like a reasonable thing to do for the
>>> pagecache (try to grab a bunch of dirty folios while XFS holds the
>>> mapping lock) so
>>>
>>> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
>>>
>>> --D
>>>
>>>
>>>> populate the batch at lookup time. Update the folio lookup path to
>>>> return the next folio in the batch, if provided, and advance the
>>>> iter if the folio starts beyond the current offset.
>>>>
>>>> Signed-off-by: Brian Foster <bfoster@redhat.com>
>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>>>> ---
>>>>  fs/iomap/buffered-io.c | 89 +++++++++++++++++++++++++++++++++++++++---
>>>>  fs/iomap/iter.c        |  6 +++
>>>>  include/linux/iomap.h  |  4 ++
>>>>  3 files changed, 94 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
>>>> index 38da2fa6e6b0..194e3cc0857f 100644
>>>> --- a/fs/iomap/buffered-io.c
>>>> +++ b/fs/iomap/buffered-io.c
>> [...]
>>>> @@ -1398,6 +1452,26 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
>>>>  	return status;
>>>>  }
>>>>  
>>>> +loff_t
>>>> +iomap_fill_dirty_folios(
>>>> +	struct iomap_iter	*iter,
>>>> +	loff_t			offset,
>>>> +	loff_t			length)
>>>> +{
>>>> +	struct address_space	*mapping = iter->inode->i_mapping;
>>>> +	pgoff_t			start = offset >> PAGE_SHIFT;
>>>> +	pgoff_t			end = (offset + length - 1) >> PAGE_SHIFT;
>>>> +
>>>> +	iter->fbatch = kmalloc(sizeof(struct folio_batch), GFP_KERNEL);
>>>> +	if (!iter->fbatch)
>>
>> Hi, Brian!
>>
>> I think ext4 needs to be aware of this failure after it converts to use
>> iomap infrastructure. It is because if we fail to add dirty folios to the
>> fbatch, iomap_zero_range() will flush those unwritten and dirty range.
>> This could potentially lead to a deadlock, as most calls to
>> ext4_block_zero_page_range() occur under an active journal handle.
>> Writeback operations under an active journal handle may result in circular
>> waiting within journal transactions. So please return this error code, and
>> then ext4 can interrupt zero operations to prevent deadlock.
>>
> 
> Hi Yi,
> 
> Thanks for looking at this.
> 
> Huh.. so the reason for falling back like this here is just that this
> was considered an optional optimization, with the flush in
> iomap_zero_range() being default fallback behavior. IIUC, what you're
> saying means that the current zero range behavior without this series is
> problematic for ext4-on-iomap..? 

Yes.

> If so, have you observed issues you can share details about?

Sure.

Before delving into the specific details of this issue, I would like
to provide some background information on the rule that ext4 cannot
wait for writeback in an active journal handle. If you are aware of
this background, please skip this paragraph. During ext4 writing back
the page cache, it may start a new journal handle to allocate blocks,
update the disksize, and convert unwritten extents after the I/O is
completed. When starting this new journal handle, if the current
running journal transaction is in the process of being submitted or
if the journal space is insufficient, it must wait for the ongoing
transaction to be completed, but the prerequisite for this is that all
currently running handles must be terminated. However, if we flush the
page cache under an active journal handle, we cannot stop it, which
may lead to a deadlock.

Now, the issue I have observed occurs when I attempt to use
iomap_zero_range() within ext4_block_zero_page_range(). My current
implementation are below(based on the latest fs-next).

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 28547663e4fd..1a21667f3f7c 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4147,6 +4147,53 @@ static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
 	return 0;
 }

+static int ext4_iomap_buffered_zero_begin(struct inode *inode, loff_t offset,
+			loff_t length, unsigned int flags, struct iomap *iomap,
+			struct iomap *srcmap)
+{
+	struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap);
+	struct ext4_map_blocks map;
+	u8 blkbits = inode->i_blkbits;
+	int ret;
+
+	ret = ext4_emergency_state(inode->i_sb);
+	if (unlikely(ret))
+		return ret;
+
+	if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
+		return -EINVAL;
+
+	/* Calculate the first and last logical blocks respectively. */
+	map.m_lblk = offset >> blkbits;
+	map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
+			  EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1;
+
+	ret = ext4_map_blocks(NULL, inode, &map, 0);
+	if (ret < 0)
+		return ret;
+
+	/*
+	 * Look up dirty folios for unwritten mappings within EOF. Providing
+	 * this bypasses the flush iomap uses to trigger extent conversion
+	 * when unwritten mappings have dirty pagecache in need of zeroing.
+	 */
+	if ((map.m_flags & EXT4_MAP_UNWRITTEN) &&
+	    map.m_lblk < EXT4_B_TO_LBLK(inode, i_size_read(inode))) {
+		loff_t end;
+
+		end = iomap_fill_dirty_folios(iter, map.m_lblk << blkbits,
+					      map.m_len << blkbits);
+		if ((end >> blkbits) < map.m_lblk + map.m_len)
+			map.m_len = (end >> blkbits) - map.m_lblk;
+	}
+
+	ext4_set_iomap(inode, iomap, &map, offset, length, flags);
+	return 0;
+}
+
+const struct iomap_ops ext4_iomap_buffered_zero_ops = {
+	.iomap_begin = ext4_iomap_buffered_zero_begin,
+};

 const struct iomap_ops ext4_iomap_buffered_write_ops = {
 	.iomap_begin = ext4_iomap_buffered_write_begin,
@@ -4611,6 +4658,17 @@ static int __ext4_block_zero_page_range(handle_t *handle,
 	return err;
 }

+static inline int ext4_iomap_zero_range(struct inode *inode, loff_t from,
+					loff_t length)
+{
+	WARN_ON_ONCE(!inode_is_locked(inode) &&
+		     !rwsem_is_locked(&inode->i_mapping->invalidate_lock));
+
+	return iomap_zero_range(inode, from, length, NULL,
+				&ext4_iomap_buffered_zero_ops,
+				&ext4_iomap_write_ops, NULL);
+}
+
 /*
  * ext4_block_zero_page_range() zeros out a mapping of length 'length'
  * starting from file offset 'from'.  The range to be zero'd must
@@ -4636,6 +4694,8 @@ static int ext4_block_zero_page_range(handle_t *handle,
 	if (IS_DAX(inode)) {
 		return dax_zero_range(inode, from, length, NULL,
 				      &ext4_iomap_ops);
+	} else if (ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP)) {
+		return ext4_iomap_zero_range(inode, from, length);
 	}
 	return __ext4_block_zero_page_range(handle, mapping, from, length);
 }

The problem is most calls to ext4_block_zero_page_range() occur under
an active journal handle, so I can reproduce the deadlock issue easily
without this series.

> 
> FWIW, I think your suggestion is reasonable, but I'm also curious what
> the error handling would look like in ext4. Do you expect to the fail
> the higher level operation, for example? Cycle locks and retry, etc.?

Originally, I wanted ext4_block_zero_page_range() to return a failure
to the higher level operation. However, unfortunately, after my testing
today, I discovered that even though we implement this, this series still
cannot resolve the issue. The corner case is:

Assume we have a dirty folio covers both hole and unwritten mappings.

   |- dirty folio  -|
   [hhhhhhhhuuuuuuuu]                h:hole, u:unwrtten

If we punch the range of the hole, ext4_punch_hole()->
ext4_zero_partial_blocks() will zero out the first half of the dirty folio.
Then, ext4_iomap_buffered_zero_begin() will skip adding this dirty folio
since the target range is a hole. Finally, iomap_zero_range() will still
flush this whole folio and lead to deadlock during writeback the latter
half of the folio.

> 
> The reason I ask is because the folio_batch handling has come up through
> discussions on this series. My position so far has been to keep it as a
> separate allocation and to keep things simple since it is currently
> isolated to zero range, but that may change if the usage spills over to
> other operations (which seems expected at this point). I suspect that if
> a filesystem actually depends on this for correct behavior, that is
> another data point worth considering on that topic.
> 
> So that has me wondering if it would be better/easier here to perhaps
> embed the batch in iomap_iter, or maybe as an incremental step put it on
> the stack in iomap_zero_range() and initialize the iomap_iter pointer
> there instead of doing the dynamic allocation (then the fill helper
> would set a flag to indicate the fs did pagecache lookup). Thoughts on
> something like that?
> 
> Also IIUC ext4-on-iomap is still a WIP and review on this series seems
> to have mostly wound down. Any objection if the fix for that comes along
> as a followup patch rather than a rework of this series?

It seems that we don't need to modify this series, we need to consider
other solutions to resolve this deadlock issue.

In my v1 ext4-on-iomap series [1], I resolved this issue by moving all
instances of ext4_block_zero_page_range() out of the running journal
handle(please see patch 19-21). But I don't think this is a good solution
since it's complex and fragile. Besides, after commit c7fc0366c6562
("ext4: partial zero eof block on unaligned inode size extension"), you
added more invocations of ext4_zero_partial_blocks(), and the situation
has become more complicated (Althrough I think the calls in the three
write_end callbacks can be removed).

Besides, IIUC, it seems that ext4 doesn't need to flush dirty folios
over unwritten mappings before zeroing partial blocks. This is because
ext4 always zeroes the in-memory page cache before zeroing(e.g, in
ext4_setattr() and ext4_punch_hole()), it means if the target range is
still dirty and unwritten when calling ext4_block_zero_page_range(), it
must has already been zeroed. Was I missing something? Therefore, I was
wondering if there are any ways to prevent flushing in
iomap_zero_range()? Any ideas?

[1] https://lore.kernel.org/linux-ext4/20241022111059.2566137-1-yi.zhang@huaweicloud.com/

> 
> Brian
> 
> P.S., I'm heading on vacation so it will likely be a week or two before
> I follow up from here, JFYI.

Wishing you a wonderful time! :-)

Best regards,
Yi.

>>
>>>> +		return offset + length;
>>>> +	folio_batch_init(iter->fbatch);
>>>> +
>>>> +	filemap_get_folios_dirty(mapping, &start, end, iter->fbatch);
>>>> +	return (start << PAGE_SHIFT);
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(iomap_fill_dirty_folios);
>>



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 3/7] iomap: optional zero range dirty folio processing
  2025-07-19 11:07         ` Zhang Yi
@ 2025-07-21  8:47           ` Zhang Yi
  2025-07-28 12:57             ` Zhang Yi
  2025-07-30 13:17           ` Brian Foster
  1 sibling, 1 reply; 37+ messages in thread
From: Zhang Yi @ 2025-07-21  8:47 UTC (permalink / raw)
  To: Brian Foster
  Cc: linux-fsdevel, linux-xfs, linux-mm, hch, willy, Darrick J. Wong,
	Ext4 Developers List

On 2025/7/19 19:07, Zhang Yi wrote:
> On 2025/7/18 21:48, Brian Foster wrote:
>> On Fri, Jul 18, 2025 at 07:30:10PM +0800, Zhang Yi wrote:
>>> On 2025/7/15 13:22, Darrick J. Wong wrote:
>>>> On Mon, Jul 14, 2025 at 04:41:18PM -0400, Brian Foster wrote:
>>>>> The only way zero range can currently process unwritten mappings
>>>>> with dirty pagecache is to check whether the range is dirty before
>>>>> mapping lookup and then flush when at least one underlying mapping
>>>>> is unwritten. This ordering is required to prevent iomap lookup from
>>>>> racing with folio writeback and reclaim.
>>>>>
>>>>> Since zero range can skip ranges of unwritten mappings that are
>>>>> clean in cache, this operation can be improved by allowing the
>>>>> filesystem to provide a set of dirty folios that require zeroing. In
>>>>> turn, rather than flush or iterate file offsets, zero range can
>>>>> iterate on folios in the batch and advance over clean or uncached
>>>>> ranges in between.
>>>>>
>>>>> Add a folio_batch in struct iomap and provide a helper for fs' to
>>>>
>>>> /me confused by the single quote; is this supposed to read:
>>>>
>>>> "...for the fs to populate..."?
>>>>
>>>> Either way the code changes look like a reasonable thing to do for the
>>>> pagecache (try to grab a bunch of dirty folios while XFS holds the
>>>> mapping lock) so
>>>>
>>>> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
>>>>
>>>> --D
>>>>
>>>>
>>>>> populate the batch at lookup time. Update the folio lookup path to
>>>>> return the next folio in the batch, if provided, and advance the
>>>>> iter if the folio starts beyond the current offset.
>>>>>
>>>>> Signed-off-by: Brian Foster <bfoster@redhat.com>
>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>>>>> ---
>>>>>  fs/iomap/buffered-io.c | 89 +++++++++++++++++++++++++++++++++++++++---
>>>>>  fs/iomap/iter.c        |  6 +++
>>>>>  include/linux/iomap.h  |  4 ++
>>>>>  3 files changed, 94 insertions(+), 5 deletions(-)
>>>>>
>>>>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
>>>>> index 38da2fa6e6b0..194e3cc0857f 100644
>>>>> --- a/fs/iomap/buffered-io.c
>>>>> +++ b/fs/iomap/buffered-io.c
>>> [...]
>>>>> @@ -1398,6 +1452,26 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
>>>>>  	return status;
>>>>>  }
>>>>>  
>>>>> +loff_t
>>>>> +iomap_fill_dirty_folios(
>>>>> +	struct iomap_iter	*iter,
>>>>> +	loff_t			offset,
>>>>> +	loff_t			length)
>>>>> +{
>>>>> +	struct address_space	*mapping = iter->inode->i_mapping;
>>>>> +	pgoff_t			start = offset >> PAGE_SHIFT;
>>>>> +	pgoff_t			end = (offset + length - 1) >> PAGE_SHIFT;
>>>>> +
>>>>> +	iter->fbatch = kmalloc(sizeof(struct folio_batch), GFP_KERNEL);
>>>>> +	if (!iter->fbatch)
>>>
>>> Hi, Brian!
>>>
>>> I think ext4 needs to be aware of this failure after it converts to use
>>> iomap infrastructure. It is because if we fail to add dirty folios to the
>>> fbatch, iomap_zero_range() will flush those unwritten and dirty range.
>>> This could potentially lead to a deadlock, as most calls to
>>> ext4_block_zero_page_range() occur under an active journal handle.
>>> Writeback operations under an active journal handle may result in circular
>>> waiting within journal transactions. So please return this error code, and
>>> then ext4 can interrupt zero operations to prevent deadlock.
>>>
>>
>> Hi Yi,
>>
>> Thanks for looking at this.
>>
>> Huh.. so the reason for falling back like this here is just that this
>> was considered an optional optimization, with the flush in
>> iomap_zero_range() being default fallback behavior. IIUC, what you're
>> saying means that the current zero range behavior without this series is
>> problematic for ext4-on-iomap..? 
> 
> Yes.
> 
>> If so, have you observed issues you can share details about?
> 
> Sure.
> 
> Before delving into the specific details of this issue, I would like
> to provide some background information on the rule that ext4 cannot
> wait for writeback in an active journal handle. If you are aware of
> this background, please skip this paragraph. During ext4 writing back
> the page cache, it may start a new journal handle to allocate blocks,
> update the disksize, and convert unwritten extents after the I/O is
> completed. When starting this new journal handle, if the current
> running journal transaction is in the process of being submitted or
> if the journal space is insufficient, it must wait for the ongoing
> transaction to be completed, but the prerequisite for this is that all
> currently running handles must be terminated. However, if we flush the
> page cache under an active journal handle, we cannot stop it, which
> may lead to a deadlock.
> 
> Now, the issue I have observed occurs when I attempt to use
> iomap_zero_range() within ext4_block_zero_page_range(). My current
> implementation are below(based on the latest fs-next).
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 28547663e4fd..1a21667f3f7c 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4147,6 +4147,53 @@ static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
>  	return 0;
>  }
> 
> +static int ext4_iomap_buffered_zero_begin(struct inode *inode, loff_t offset,
> +			loff_t length, unsigned int flags, struct iomap *iomap,
> +			struct iomap *srcmap)
> +{
> +	struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap);
> +	struct ext4_map_blocks map;
> +	u8 blkbits = inode->i_blkbits;
> +	int ret;
> +
> +	ret = ext4_emergency_state(inode->i_sb);
> +	if (unlikely(ret))
> +		return ret;
> +
> +	if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
> +		return -EINVAL;
> +
> +	/* Calculate the first and last logical blocks respectively. */
> +	map.m_lblk = offset >> blkbits;
> +	map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
> +			  EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1;
> +
> +	ret = ext4_map_blocks(NULL, inode, &map, 0);
> +	if (ret < 0)
> +		return ret;
> +
> +	/*
> +	 * Look up dirty folios for unwritten mappings within EOF. Providing
> +	 * this bypasses the flush iomap uses to trigger extent conversion
> +	 * when unwritten mappings have dirty pagecache in need of zeroing.
> +	 */
> +	if ((map.m_flags & EXT4_MAP_UNWRITTEN) &&
> +	    map.m_lblk < EXT4_B_TO_LBLK(inode, i_size_read(inode))) {
> +		loff_t end;
> +
> +		end = iomap_fill_dirty_folios(iter, map.m_lblk << blkbits,
> +					      map.m_len << blkbits);
> +		if ((end >> blkbits) < map.m_lblk + map.m_len)
> +			map.m_len = (end >> blkbits) - map.m_lblk;
> +	}
> +
> +	ext4_set_iomap(inode, iomap, &map, offset, length, flags);
> +	return 0;
> +}
> +
> +const struct iomap_ops ext4_iomap_buffered_zero_ops = {
> +	.iomap_begin = ext4_iomap_buffered_zero_begin,
> +};
> 
>  const struct iomap_ops ext4_iomap_buffered_write_ops = {
>  	.iomap_begin = ext4_iomap_buffered_write_begin,
> @@ -4611,6 +4658,17 @@ static int __ext4_block_zero_page_range(handle_t *handle,
>  	return err;
>  }
> 
> +static inline int ext4_iomap_zero_range(struct inode *inode, loff_t from,
> +					loff_t length)
> +{
> +	WARN_ON_ONCE(!inode_is_locked(inode) &&
> +		     !rwsem_is_locked(&inode->i_mapping->invalidate_lock));
> +
> +	return iomap_zero_range(inode, from, length, NULL,
> +				&ext4_iomap_buffered_zero_ops,
> +				&ext4_iomap_write_ops, NULL);
> +}
> +
>  /*
>   * ext4_block_zero_page_range() zeros out a mapping of length 'length'
>   * starting from file offset 'from'.  The range to be zero'd must
> @@ -4636,6 +4694,8 @@ static int ext4_block_zero_page_range(handle_t *handle,
>  	if (IS_DAX(inode)) {
>  		return dax_zero_range(inode, from, length, NULL,
>  				      &ext4_iomap_ops);
> +	} else if (ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP)) {
> +		return ext4_iomap_zero_range(inode, from, length);
>  	}
>  	return __ext4_block_zero_page_range(handle, mapping, from, length);
>  }
> 
> The problem is most calls to ext4_block_zero_page_range() occur under
> an active journal handle, so I can reproduce the deadlock issue easily
> without this series.
> 
>>
>> FWIW, I think your suggestion is reasonable, but I'm also curious what
>> the error handling would look like in ext4. Do you expect to the fail
>> the higher level operation, for example? Cycle locks and retry, etc.?
> 
> Originally, I wanted ext4_block_zero_page_range() to return a failure
> to the higher level operation. However, unfortunately, after my testing
> today, I discovered that even though we implement this, this series still
> cannot resolve the issue. The corner case is:
> 
> Assume we have a dirty folio covers both hole and unwritten mappings.
> 
>    |- dirty folio  -|
>    [hhhhhhhhuuuuuuuu]                h:hole, u:unwrtten
> 
> If we punch the range of the hole, ext4_punch_hole()->
> ext4_zero_partial_blocks() will zero out the first half of the dirty folio.
> Then, ext4_iomap_buffered_zero_begin() will skip adding this dirty folio
> since the target range is a hole. Finally, iomap_zero_range() will still
> flush this whole folio and lead to deadlock during writeback the latter
> half of the folio.
> 
>>
>> The reason I ask is because the folio_batch handling has come up through
>> discussions on this series. My position so far has been to keep it as a
>> separate allocation and to keep things simple since it is currently
>> isolated to zero range, but that may change if the usage spills over to
>> other operations (which seems expected at this point). I suspect that if
>> a filesystem actually depends on this for correct behavior, that is
>> another data point worth considering on that topic.
>>
>> So that has me wondering if it would be better/easier here to perhaps
>> embed the batch in iomap_iter, or maybe as an incremental step put it on
>> the stack in iomap_zero_range() and initialize the iomap_iter pointer
>> there instead of doing the dynamic allocation (then the fill helper
>> would set a flag to indicate the fs did pagecache lookup). Thoughts on
>> something like that?
>>
>> Also IIUC ext4-on-iomap is still a WIP and review on this series seems
>> to have mostly wound down. Any objection if the fix for that comes along
>> as a followup patch rather than a rework of this series?
> 
> It seems that we don't need to modify this series, we need to consider
> other solutions to resolve this deadlock issue.
> 
> In my v1 ext4-on-iomap series [1], I resolved this issue by moving all
> instances of ext4_block_zero_page_range() out of the running journal
> handle(please see patch 19-21). But I don't think this is a good solution
> since it's complex and fragile. Besides, after commit c7fc0366c6562
> ("ext4: partial zero eof block on unaligned inode size extension"), you
> added more invocations of ext4_zero_partial_blocks(), and the situation
> has become more complicated (Althrough I think the calls in the three
> write_end callbacks can be removed).
> 
> Besides, IIUC, it seems that ext4 doesn't need to flush dirty folios
> over unwritten mappings before zeroing partial blocks. This is because
> ext4 always zeroes the in-memory page cache before zeroing(e.g, in
> ext4_setattr() and ext4_punch_hole()), it means if the target range is
> still dirty and unwritten when calling ext4_block_zero_page_range(), it
> must has already been zeroed. Was I missing something? Therefore, I was
> wondering if there are any ways to prevent flushing in
> iomap_zero_range()? Any ideas?
> 

The commit 7d9b474ee4cc ("iomap: make zero range flush conditional on
unwritten mappings") mentioned the following:

  iomap_zero_range() flushes pagecache to mitigate consistency
  problems with dirty pagecache and unwritten mappings. The flush is
  unconditional over the entire range because checking pagecache state
  after mapping lookup is racy with writeback and reclaim. There are
  ways around this using iomap's mapping revalidation mechanism, but
  this is not supported by all iomap based filesystems and so is not a
  generic solution.

Does the revalidation mechanism here refer to verifying the validity of
the mapping through iomap_write_ops->iomap_valid()? IIUC, does this mean
that if the filesystem implement the iomap_valid() interface, we can
always avoid the iomap_zero_range() from flushing dirty folios back?
Something like below:

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 73772d34f502..ba71a6ed2f77 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1522,7 +1522,10 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,

			if (range_dirty) {
				range_dirty = false;
-				status = iomap_zero_iter_flush_and_stale(&iter);
+				if (write_ops->iomap_valid)
+					status = iomap_zero_iter(&iter, did_zero, write_ops);
+				else
+					status = iomap_zero_iter_flush_and_stale(&iter);
			} else {
				status = iomap_iter_advance_full(&iter);
			}

Thanks,
Yi.

> [1] https://lore.kernel.org/linux-ext4/20241022111059.2566137-1-yi.zhang@huaweicloud.com/
> 
>>
>> Brian
>>
>> P.S., I'm heading on vacation so it will likely be a week or two before
>> I follow up from here, JFYI.
> 
> Wishing you a wonderful time! :-)
> 
> Best regards,
> Yi.
> 
>>>
>>>>> +		return offset + length;
>>>>> +	folio_batch_init(iter->fbatch);
>>>>> +
>>>>> +	filemap_get_folios_dirty(mapping, &start, end, iter->fbatch);
>>>>> +	return (start << PAGE_SHIFT);
>>>>> +}
>>>>> +EXPORT_SYMBOL_GPL(iomap_fill_dirty_folios);
>>>
> 
> 



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 3/7] iomap: optional zero range dirty folio processing
  2025-07-21  8:47           ` Zhang Yi
@ 2025-07-28 12:57             ` Zhang Yi
  2025-07-30 13:19               ` Brian Foster
  0 siblings, 1 reply; 37+ messages in thread
From: Zhang Yi @ 2025-07-28 12:57 UTC (permalink / raw)
  To: Brian Foster
  Cc: linux-fsdevel, linux-xfs, linux-mm, hch, willy, Darrick J. Wong,
	Ext4 Developers List

On 2025/7/21 16:47, Zhang Yi wrote:
> On 2025/7/19 19:07, Zhang Yi wrote:
>> On 2025/7/18 21:48, Brian Foster wrote:
>>> On Fri, Jul 18, 2025 at 07:30:10PM +0800, Zhang Yi wrote:
>>>> On 2025/7/15 13:22, Darrick J. Wong wrote:
>>>>> On Mon, Jul 14, 2025 at 04:41:18PM -0400, Brian Foster wrote:
>>>>>> The only way zero range can currently process unwritten mappings
>>>>>> with dirty pagecache is to check whether the range is dirty before
>>>>>> mapping lookup and then flush when at least one underlying mapping
>>>>>> is unwritten. This ordering is required to prevent iomap lookup from
>>>>>> racing with folio writeback and reclaim.
>>>>>>
>>>>>> Since zero range can skip ranges of unwritten mappings that are
>>>>>> clean in cache, this operation can be improved by allowing the
>>>>>> filesystem to provide a set of dirty folios that require zeroing. In
>>>>>> turn, rather than flush or iterate file offsets, zero range can
>>>>>> iterate on folios in the batch and advance over clean or uncached
>>>>>> ranges in between.
>>>>>>
>>>>>> Add a folio_batch in struct iomap and provide a helper for fs' to
>>>>>
>>>>> /me confused by the single quote; is this supposed to read:
>>>>>
>>>>> "...for the fs to populate..."?
>>>>>
>>>>> Either way the code changes look like a reasonable thing to do for the
>>>>> pagecache (try to grab a bunch of dirty folios while XFS holds the
>>>>> mapping lock) so
>>>>>
>>>>> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
>>>>>
>>>>> --D
>>>>>
>>>>>
>>>>>> populate the batch at lookup time. Update the folio lookup path to
>>>>>> return the next folio in the batch, if provided, and advance the
>>>>>> iter if the folio starts beyond the current offset.
>>>>>>
>>>>>> Signed-off-by: Brian Foster <bfoster@redhat.com>
>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>>>>>> ---
>>>>>>  fs/iomap/buffered-io.c | 89 +++++++++++++++++++++++++++++++++++++++---
>>>>>>  fs/iomap/iter.c        |  6 +++
>>>>>>  include/linux/iomap.h  |  4 ++
>>>>>>  3 files changed, 94 insertions(+), 5 deletions(-)
>>>>>>
>>>>>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
>>>>>> index 38da2fa6e6b0..194e3cc0857f 100644
>>>>>> --- a/fs/iomap/buffered-io.c
>>>>>> +++ b/fs/iomap/buffered-io.c
>>>> [...]
>>>>>> @@ -1398,6 +1452,26 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
>>>>>>  	return status;
>>>>>>  }
>>>>>>  
>>>>>> +loff_t
>>>>>> +iomap_fill_dirty_folios(
>>>>>> +	struct iomap_iter	*iter,
>>>>>> +	loff_t			offset,
>>>>>> +	loff_t			length)
>>>>>> +{
>>>>>> +	struct address_space	*mapping = iter->inode->i_mapping;
>>>>>> +	pgoff_t			start = offset >> PAGE_SHIFT;
>>>>>> +	pgoff_t			end = (offset + length - 1) >> PAGE_SHIFT;
>>>>>> +
>>>>>> +	iter->fbatch = kmalloc(sizeof(struct folio_batch), GFP_KERNEL);
>>>>>> +	if (!iter->fbatch)
>>>>
>>>> Hi, Brian!
>>>>
>>>> I think ext4 needs to be aware of this failure after it converts to use
>>>> iomap infrastructure. It is because if we fail to add dirty folios to the
>>>> fbatch, iomap_zero_range() will flush those unwritten and dirty range.
>>>> This could potentially lead to a deadlock, as most calls to
>>>> ext4_block_zero_page_range() occur under an active journal handle.
>>>> Writeback operations under an active journal handle may result in circular
>>>> waiting within journal transactions. So please return this error code, and
>>>> then ext4 can interrupt zero operations to prevent deadlock.
>>>>
>>>
>>> Hi Yi,
>>>
>>> Thanks for looking at this.
>>>
>>> Huh.. so the reason for falling back like this here is just that this
>>> was considered an optional optimization, with the flush in
>>> iomap_zero_range() being default fallback behavior. IIUC, what you're
>>> saying means that the current zero range behavior without this series is
>>> problematic for ext4-on-iomap..? 
>>
>> Yes.
>>
>>> If so, have you observed issues you can share details about?
>>
>> Sure.
>>
>> Before delving into the specific details of this issue, I would like
>> to provide some background information on the rule that ext4 cannot
>> wait for writeback in an active journal handle. If you are aware of
>> this background, please skip this paragraph. During ext4 writing back
>> the page cache, it may start a new journal handle to allocate blocks,
>> update the disksize, and convert unwritten extents after the I/O is
>> completed. When starting this new journal handle, if the current
>> running journal transaction is in the process of being submitted or
>> if the journal space is insufficient, it must wait for the ongoing
>> transaction to be completed, but the prerequisite for this is that all
>> currently running handles must be terminated. However, if we flush the
>> page cache under an active journal handle, we cannot stop it, which
>> may lead to a deadlock.
>>
>> Now, the issue I have observed occurs when I attempt to use
>> iomap_zero_range() within ext4_block_zero_page_range(). My current
>> implementation are below(based on the latest fs-next).
>>
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index 28547663e4fd..1a21667f3f7c 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -4147,6 +4147,53 @@ static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
>>  	return 0;
>>  }
>>
>> +static int ext4_iomap_buffered_zero_begin(struct inode *inode, loff_t offset,
>> +			loff_t length, unsigned int flags, struct iomap *iomap,
>> +			struct iomap *srcmap)
>> +{
>> +	struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap);
>> +	struct ext4_map_blocks map;
>> +	u8 blkbits = inode->i_blkbits;
>> +	int ret;
>> +
>> +	ret = ext4_emergency_state(inode->i_sb);
>> +	if (unlikely(ret))
>> +		return ret;
>> +
>> +	if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
>> +		return -EINVAL;
>> +
>> +	/* Calculate the first and last logical blocks respectively. */
>> +	map.m_lblk = offset >> blkbits;
>> +	map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
>> +			  EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1;
>> +
>> +	ret = ext4_map_blocks(NULL, inode, &map, 0);
>> +	if (ret < 0)
>> +		return ret;
>> +
>> +	/*
>> +	 * Look up dirty folios for unwritten mappings within EOF. Providing
>> +	 * this bypasses the flush iomap uses to trigger extent conversion
>> +	 * when unwritten mappings have dirty pagecache in need of zeroing.
>> +	 */
>> +	if ((map.m_flags & EXT4_MAP_UNWRITTEN) &&
>> +	    map.m_lblk < EXT4_B_TO_LBLK(inode, i_size_read(inode))) {
>> +		loff_t end;
>> +
>> +		end = iomap_fill_dirty_folios(iter, map.m_lblk << blkbits,
>> +					      map.m_len << blkbits);
>> +		if ((end >> blkbits) < map.m_lblk + map.m_len)
>> +			map.m_len = (end >> blkbits) - map.m_lblk;
>> +	}
>> +
>> +	ext4_set_iomap(inode, iomap, &map, offset, length, flags);
>> +	return 0;
>> +}
>> +
>> +const struct iomap_ops ext4_iomap_buffered_zero_ops = {
>> +	.iomap_begin = ext4_iomap_buffered_zero_begin,
>> +};
>>
>>  const struct iomap_ops ext4_iomap_buffered_write_ops = {
>>  	.iomap_begin = ext4_iomap_buffered_write_begin,
>> @@ -4611,6 +4658,17 @@ static int __ext4_block_zero_page_range(handle_t *handle,
>>  	return err;
>>  }
>>
>> +static inline int ext4_iomap_zero_range(struct inode *inode, loff_t from,
>> +					loff_t length)
>> +{
>> +	WARN_ON_ONCE(!inode_is_locked(inode) &&
>> +		     !rwsem_is_locked(&inode->i_mapping->invalidate_lock));
>> +
>> +	return iomap_zero_range(inode, from, length, NULL,
>> +				&ext4_iomap_buffered_zero_ops,
>> +				&ext4_iomap_write_ops, NULL);
>> +}
>> +
>>  /*
>>   * ext4_block_zero_page_range() zeros out a mapping of length 'length'
>>   * starting from file offset 'from'.  The range to be zero'd must
>> @@ -4636,6 +4694,8 @@ static int ext4_block_zero_page_range(handle_t *handle,
>>  	if (IS_DAX(inode)) {
>>  		return dax_zero_range(inode, from, length, NULL,
>>  				      &ext4_iomap_ops);
>> +	} else if (ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP)) {
>> +		return ext4_iomap_zero_range(inode, from, length);
>>  	}
>>  	return __ext4_block_zero_page_range(handle, mapping, from, length);
>>  }
>>
>> The problem is most calls to ext4_block_zero_page_range() occur under
>> an active journal handle, so I can reproduce the deadlock issue easily
>> without this series.
>>
>>>
>>> FWIW, I think your suggestion is reasonable, but I'm also curious what
>>> the error handling would look like in ext4. Do you expect to the fail
>>> the higher level operation, for example? Cycle locks and retry, etc.?
>>
>> Originally, I wanted ext4_block_zero_page_range() to return a failure
>> to the higher level operation. However, unfortunately, after my testing
>> today, I discovered that even though we implement this, this series still
>> cannot resolve the issue. The corner case is:
>>
>> Assume we have a dirty folio covers both hole and unwritten mappings.
>>
>>    |- dirty folio  -|
>>    [hhhhhhhhuuuuuuuu]                h:hole, u:unwrtten
>>
>> If we punch the range of the hole, ext4_punch_hole()->
>> ext4_zero_partial_blocks() will zero out the first half of the dirty folio.
>> Then, ext4_iomap_buffered_zero_begin() will skip adding this dirty folio
>> since the target range is a hole. Finally, iomap_zero_range() will still
>> flush this whole folio and lead to deadlock during writeback the latter
>> half of the folio.
>>
>>>
>>> The reason I ask is because the folio_batch handling has come up through
>>> discussions on this series. My position so far has been to keep it as a
>>> separate allocation and to keep things simple since it is currently
>>> isolated to zero range, but that may change if the usage spills over to
>>> other operations (which seems expected at this point). I suspect that if
>>> a filesystem actually depends on this for correct behavior, that is
>>> another data point worth considering on that topic.
>>>
>>> So that has me wondering if it would be better/easier here to perhaps
>>> embed the batch in iomap_iter, or maybe as an incremental step put it on
>>> the stack in iomap_zero_range() and initialize the iomap_iter pointer
>>> there instead of doing the dynamic allocation (then the fill helper
>>> would set a flag to indicate the fs did pagecache lookup). Thoughts on
>>> something like that?
>>>
>>> Also IIUC ext4-on-iomap is still a WIP and review on this series seems
>>> to have mostly wound down. Any objection if the fix for that comes along
>>> as a followup patch rather than a rework of this series?
>>
>> It seems that we don't need to modify this series, we need to consider
>> other solutions to resolve this deadlock issue.
>>
>> In my v1 ext4-on-iomap series [1], I resolved this issue by moving all
>> instances of ext4_block_zero_page_range() out of the running journal
>> handle(please see patch 19-21). But I don't think this is a good solution
>> since it's complex and fragile. Besides, after commit c7fc0366c6562
>> ("ext4: partial zero eof block on unaligned inode size extension"), you
>> added more invocations of ext4_zero_partial_blocks(), and the situation
>> has become more complicated (Althrough I think the calls in the three
>> write_end callbacks can be removed).
>>
>> Besides, IIUC, it seems that ext4 doesn't need to flush dirty folios
>> over unwritten mappings before zeroing partial blocks. This is because
>> ext4 always zeroes the in-memory page cache before zeroing(e.g, in
>> ext4_setattr() and ext4_punch_hole()), it means if the target range is
>> still dirty and unwritten when calling ext4_block_zero_page_range(), it
>> must has already been zeroed. Was I missing something? Therefore, I was
>> wondering if there are any ways to prevent flushing in
>> iomap_zero_range()? Any ideas?
>>
> 
> The commit 7d9b474ee4cc ("iomap: make zero range flush conditional on
> unwritten mappings") mentioned the following:
> 
>   iomap_zero_range() flushes pagecache to mitigate consistency
>   problems with dirty pagecache and unwritten mappings. The flush is
>   unconditional over the entire range because checking pagecache state
>   after mapping lookup is racy with writeback and reclaim. There are
>   ways around this using iomap's mapping revalidation mechanism, but
>   this is not supported by all iomap based filesystems and so is not a
>   generic solution.
> 
> Does the revalidation mechanism here refer to verifying the validity of
> the mapping through iomap_write_ops->iomap_valid()? IIUC, does this mean
> that if the filesystem implement the iomap_valid() interface, we can
> always avoid the iomap_zero_range() from flushing dirty folios back?
> Something like below:
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 73772d34f502..ba71a6ed2f77 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -1522,7 +1522,10 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
> 
> 			if (range_dirty) {
> 				range_dirty = false;
> -				status = iomap_zero_iter_flush_and_stale(&iter);
> +				if (write_ops->iomap_valid)
> +					status = iomap_zero_iter(&iter, did_zero, write_ops);
> +				else
> +					status = iomap_zero_iter_flush_and_stale(&iter);
> 			} else {
> 				status = iomap_iter_advance_full(&iter);
> 			}
>

The above diff will trigger
WARN_ON_ONCE(folio_pos(folio) > iter->inode->i_size) in iomap_zero_iter()
on XFS. I revised the 'diff' and ran xfstests with several main configs
on both XFS and ext4(with iomap infrastructure), and everything seems to
be working fine so far. What do you think?

Thanks,
Yi.


diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 73772d34f502..a75cdb22bab0 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1444,9 +1444,6 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero,
 			break;
 		}

-		/* warn about zeroing folios beyond eof that won't write back */
-		WARN_ON_ONCE(folio_pos(folio) > iter->inode->i_size);
-
 		folio_zero_range(folio, offset, bytes);
 		folio_mark_accessed(folio);

@@ -1515,22 +1512,44 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
 				 srcmap->type != IOMAP_UNWRITTEN))
 			return -EIO;

-		if (!iter.fbatch &&
-		    (srcmap->type == IOMAP_HOLE ||
-		     srcmap->type == IOMAP_UNWRITTEN)) {
-			s64 status;
+		if (iter.fbatch || (srcmap->type != IOMAP_HOLE &&
+				    srcmap->type != IOMAP_UNWRITTEN)) {
+			iter.status = iomap_zero_iter(&iter, did_zero,
+						      write_ops);
+			continue;
+		}

-			if (range_dirty) {
-				range_dirty = false;
-				status = iomap_zero_iter_flush_and_stale(&iter);
-			} else {
-				status = iomap_iter_advance_full(&iter);
-			}
-			iter.status = status;
+		/*
+		 * No fbatch, and the target is either a hole or an unwritten
+		 * range, skip zeroing if the range is not dirty.
+		 */
+		if (!range_dirty) {
+			iter.status = iomap_iter_advance_full(&iter);
 			continue;
 		}

-		iter.status = iomap_zero_iter(&iter, did_zero, write_ops);
+		/*
+		 * The range is dirty, if the given filesystm does not specify
+		 * a revalidation mechanism, flush the entire range to prevent
+		 * mapping changes that could race with writeback and reclaim.
+		 */
+		if (!write_ops->iomap_valid) {
+			range_dirty = false;
+			iter.status = iomap_zero_iter_flush_and_stale(&iter);
+			continue;
+		}
+
+		/*
+		 * The filesystem specifies an iomap_valid() helper. It is safe
+		 * to zero out the current range if it is unwritten and dirty.
+		 */
+		if (srcmap->type == IOMAP_UNWRITTEN &&
+		    filemap_range_needs_writeback(mapping, iter.pos,
+						  iomap_length(&iter)))
+			iter.status = iomap_zero_iter(&iter, did_zero,
+						      write_ops);
+		else
+			iter.status = iomap_iter_advance_full(&iter);
 	}
 	return ret;
 }




^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 3/7] iomap: optional zero range dirty folio processing
  2025-07-19 11:07         ` Zhang Yi
  2025-07-21  8:47           ` Zhang Yi
@ 2025-07-30 13:17           ` Brian Foster
  2025-08-02  7:19             ` Zhang Yi
  1 sibling, 1 reply; 37+ messages in thread
From: Brian Foster @ 2025-07-30 13:17 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-fsdevel, linux-xfs, linux-mm, hch, willy, Darrick J. Wong,
	Ext4 Developers List

On Sat, Jul 19, 2025 at 07:07:43PM +0800, Zhang Yi wrote:
> On 2025/7/18 21:48, Brian Foster wrote:
> > On Fri, Jul 18, 2025 at 07:30:10PM +0800, Zhang Yi wrote:
> >> On 2025/7/15 13:22, Darrick J. Wong wrote:
> >>> On Mon, Jul 14, 2025 at 04:41:18PM -0400, Brian Foster wrote:
> >>>> The only way zero range can currently process unwritten mappings
> >>>> with dirty pagecache is to check whether the range is dirty before
> >>>> mapping lookup and then flush when at least one underlying mapping
> >>>> is unwritten. This ordering is required to prevent iomap lookup from
> >>>> racing with folio writeback and reclaim.
> >>>>
> >>>> Since zero range can skip ranges of unwritten mappings that are
> >>>> clean in cache, this operation can be improved by allowing the
> >>>> filesystem to provide a set of dirty folios that require zeroing. In
> >>>> turn, rather than flush or iterate file offsets, zero range can
> >>>> iterate on folios in the batch and advance over clean or uncached
> >>>> ranges in between.
> >>>>
> >>>> Add a folio_batch in struct iomap and provide a helper for fs' to
> >>>
> >>> /me confused by the single quote; is this supposed to read:
> >>>
> >>> "...for the fs to populate..."?
> >>>
> >>> Either way the code changes look like a reasonable thing to do for the
> >>> pagecache (try to grab a bunch of dirty folios while XFS holds the
> >>> mapping lock) so
> >>>
> >>> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
> >>>
> >>> --D
> >>>
> >>>
> >>>> populate the batch at lookup time. Update the folio lookup path to
> >>>> return the next folio in the batch, if provided, and advance the
> >>>> iter if the folio starts beyond the current offset.
> >>>>
> >>>> Signed-off-by: Brian Foster <bfoster@redhat.com>
> >>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
> >>>> ---
> >>>>  fs/iomap/buffered-io.c | 89 +++++++++++++++++++++++++++++++++++++++---
> >>>>  fs/iomap/iter.c        |  6 +++
> >>>>  include/linux/iomap.h  |  4 ++
> >>>>  3 files changed, 94 insertions(+), 5 deletions(-)
> >>>>
> >>>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> >>>> index 38da2fa6e6b0..194e3cc0857f 100644
> >>>> --- a/fs/iomap/buffered-io.c
> >>>> +++ b/fs/iomap/buffered-io.c
> >> [...]
> >>>> @@ -1398,6 +1452,26 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
> >>>>  	return status;
> >>>>  }
> >>>>  
> >>>> +loff_t
> >>>> +iomap_fill_dirty_folios(
> >>>> +	struct iomap_iter	*iter,
> >>>> +	loff_t			offset,
> >>>> +	loff_t			length)
> >>>> +{
> >>>> +	struct address_space	*mapping = iter->inode->i_mapping;
> >>>> +	pgoff_t			start = offset >> PAGE_SHIFT;
> >>>> +	pgoff_t			end = (offset + length - 1) >> PAGE_SHIFT;
> >>>> +
> >>>> +	iter->fbatch = kmalloc(sizeof(struct folio_batch), GFP_KERNEL);
> >>>> +	if (!iter->fbatch)
> >>
> >> Hi, Brian!
> >>
> >> I think ext4 needs to be aware of this failure after it converts to use
> >> iomap infrastructure. It is because if we fail to add dirty folios to the
> >> fbatch, iomap_zero_range() will flush those unwritten and dirty range.
> >> This could potentially lead to a deadlock, as most calls to
> >> ext4_block_zero_page_range() occur under an active journal handle.
> >> Writeback operations under an active journal handle may result in circular
> >> waiting within journal transactions. So please return this error code, and
> >> then ext4 can interrupt zero operations to prevent deadlock.
> >>
> > 
> > Hi Yi,
> > 
> > Thanks for looking at this.
> > 
> > Huh.. so the reason for falling back like this here is just that this
> > was considered an optional optimization, with the flush in
> > iomap_zero_range() being default fallback behavior. IIUC, what you're
> > saying means that the current zero range behavior without this series is
> > problematic for ext4-on-iomap..? 
> 
> Yes.
> 
> > If so, have you observed issues you can share details about?
> 
> Sure.
> 
> Before delving into the specific details of this issue, I would like
> to provide some background information on the rule that ext4 cannot
> wait for writeback in an active journal handle. If you are aware of
> this background, please skip this paragraph. During ext4 writing back
> the page cache, it may start a new journal handle to allocate blocks,
> update the disksize, and convert unwritten extents after the I/O is
> completed. When starting this new journal handle, if the current
> running journal transaction is in the process of being submitted or
> if the journal space is insufficient, it must wait for the ongoing
> transaction to be completed, but the prerequisite for this is that all
> currently running handles must be terminated. However, if we flush the
> page cache under an active journal handle, we cannot stop it, which
> may lead to a deadlock.
> 

Ok, makes sense.

> Now, the issue I have observed occurs when I attempt to use
> iomap_zero_range() within ext4_block_zero_page_range(). My current
> implementation are below(based on the latest fs-next).
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 28547663e4fd..1a21667f3f7c 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4147,6 +4147,53 @@ static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
>  	return 0;
>  }
> 
> +static int ext4_iomap_buffered_zero_begin(struct inode *inode, loff_t offset,
> +			loff_t length, unsigned int flags, struct iomap *iomap,
> +			struct iomap *srcmap)
> +{
> +	struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap);
> +	struct ext4_map_blocks map;
> +	u8 blkbits = inode->i_blkbits;
> +	int ret;
> +
> +	ret = ext4_emergency_state(inode->i_sb);
> +	if (unlikely(ret))
> +		return ret;
> +
> +	if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
> +		return -EINVAL;
> +
> +	/* Calculate the first and last logical blocks respectively. */
> +	map.m_lblk = offset >> blkbits;
> +	map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
> +			  EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1;
> +
> +	ret = ext4_map_blocks(NULL, inode, &map, 0);
> +	if (ret < 0)
> +		return ret;
> +
> +	/*
> +	 * Look up dirty folios for unwritten mappings within EOF. Providing
> +	 * this bypasses the flush iomap uses to trigger extent conversion
> +	 * when unwritten mappings have dirty pagecache in need of zeroing.
> +	 */
> +	if ((map.m_flags & EXT4_MAP_UNWRITTEN) &&
> +	    map.m_lblk < EXT4_B_TO_LBLK(inode, i_size_read(inode))) {
> +		loff_t end;
> +
> +		end = iomap_fill_dirty_folios(iter, map.m_lblk << blkbits,
> +					      map.m_len << blkbits);
> +		if ((end >> blkbits) < map.m_lblk + map.m_len)
> +			map.m_len = (end >> blkbits) - map.m_lblk;
> +	}
> +
> +	ext4_set_iomap(inode, iomap, &map, offset, length, flags);
> +	return 0;
> +}
> +
> +const struct iomap_ops ext4_iomap_buffered_zero_ops = {
> +	.iomap_begin = ext4_iomap_buffered_zero_begin,
> +};
> 
>  const struct iomap_ops ext4_iomap_buffered_write_ops = {
>  	.iomap_begin = ext4_iomap_buffered_write_begin,
> @@ -4611,6 +4658,17 @@ static int __ext4_block_zero_page_range(handle_t *handle,
>  	return err;
>  }
> 
> +static inline int ext4_iomap_zero_range(struct inode *inode, loff_t from,
> +					loff_t length)
> +{
> +	WARN_ON_ONCE(!inode_is_locked(inode) &&
> +		     !rwsem_is_locked(&inode->i_mapping->invalidate_lock));
> +
> +	return iomap_zero_range(inode, from, length, NULL,
> +				&ext4_iomap_buffered_zero_ops,
> +				&ext4_iomap_write_ops, NULL);
> +}
> +
>  /*
>   * ext4_block_zero_page_range() zeros out a mapping of length 'length'
>   * starting from file offset 'from'.  The range to be zero'd must
> @@ -4636,6 +4694,8 @@ static int ext4_block_zero_page_range(handle_t *handle,
>  	if (IS_DAX(inode)) {
>  		return dax_zero_range(inode, from, length, NULL,
>  				      &ext4_iomap_ops);
> +	} else if (ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP)) {
> +		return ext4_iomap_zero_range(inode, from, length);
>  	}
>  	return __ext4_block_zero_page_range(handle, mapping, from, length);
>  }
> 
> The problem is most calls to ext4_block_zero_page_range() occur under
> an active journal handle, so I can reproduce the deadlock issue easily
> without this series.
> 
> > 
> > FWIW, I think your suggestion is reasonable, but I'm also curious what
> > the error handling would look like in ext4. Do you expect to the fail
> > the higher level operation, for example? Cycle locks and retry, etc.?
> 
> Originally, I wanted ext4_block_zero_page_range() to return a failure
> to the higher level operation. However, unfortunately, after my testing
> today, I discovered that even though we implement this, this series still
> cannot resolve the issue. The corner case is:
> 
> Assume we have a dirty folio covers both hole and unwritten mappings.
> 
>    |- dirty folio  -|
>    [hhhhhhhhuuuuuuuu]                h:hole, u:unwrtten
> 
> If we punch the range of the hole, ext4_punch_hole()->
> ext4_zero_partial_blocks() will zero out the first half of the dirty folio.
> Then, ext4_iomap_buffered_zero_begin() will skip adding this dirty folio
> since the target range is a hole. Finally, iomap_zero_range() will still
> flush this whole folio and lead to deadlock during writeback the latter
> half of the folio.
> 

Hmm.. Ok. So it seems there are at least a couple ways around this
particular quirk. I suspect one is that you could just call the fill
helper in the hole case as well, but that's kind of a hack and not
really intended use.

The other way goes back to the fact that the flush for the hole case was
kind of a corner case hack in the first place. The original comment for
that seems to have been dropped, but see commit 7d9b474ee4cc ("iomap:
make zero range flush conditional on unwritten mappings") for reference
to the original intent.

I'd have to go back and investigate if something regresses with that
taken out, but my recollection is that was something that needed proper
fixing eventually anyways. I'm particularly wondering if that is no
longer an issue now that pagecache_isize_extended() handles the post-eof
zeroing (the caveat being we might just need to call it in some
additional size extension cases besides just setattr/truncate).

> > 
> > The reason I ask is because the folio_batch handling has come up through
> > discussions on this series. My position so far has been to keep it as a
> > separate allocation and to keep things simple since it is currently
> > isolated to zero range, but that may change if the usage spills over to
> > other operations (which seems expected at this point). I suspect that if
> > a filesystem actually depends on this for correct behavior, that is
> > another data point worth considering on that topic.
> > 
> > So that has me wondering if it would be better/easier here to perhaps
> > embed the batch in iomap_iter, or maybe as an incremental step put it on
> > the stack in iomap_zero_range() and initialize the iomap_iter pointer
> > there instead of doing the dynamic allocation (then the fill helper
> > would set a flag to indicate the fs did pagecache lookup). Thoughts on
> > something like that?
> > 
> > Also IIUC ext4-on-iomap is still a WIP and review on this series seems
> > to have mostly wound down. Any objection if the fix for that comes along
> > as a followup patch rather than a rework of this series?
> 
> It seems that we don't need to modify this series, we need to consider
> other solutions to resolve this deadlock issue.
> 
> In my v1 ext4-on-iomap series [1], I resolved this issue by moving all
> instances of ext4_block_zero_page_range() out of the running journal
> handle(please see patch 19-21). But I don't think this is a good solution
> since it's complex and fragile. Besides, after commit c7fc0366c6562
> ("ext4: partial zero eof block on unaligned inode size extension"), you
> added more invocations of ext4_zero_partial_blocks(), and the situation
> has become more complicated (Althrough I think the calls in the three
> write_end callbacks can be removed).
> 
> Besides, IIUC, it seems that ext4 doesn't need to flush dirty folios
> over unwritten mappings before zeroing partial blocks. This is because
> ext4 always zeroes the in-memory page cache before zeroing(e.g, in
> ext4_setattr() and ext4_punch_hole()), it means if the target range is
> still dirty and unwritten when calling ext4_block_zero_page_range(), it
> must has already been zeroed. Was I missing something? Therefore, I was
> wondering if there are any ways to prevent flushing in
> iomap_zero_range()? Any ideas?

It's certainly possible that the quirk fixed by the flush the hole case
was never a problem on ext4, if that's what you mean. Most of the
testing for this was on XFS since ext4 hadn't used iomap for buffered
writes.

At the end of the day, the batch mechanism is intended to facilitate
avoiding the flush entirely. I'm still paging things back in here.. but
if we had two smallish changes to this code path to 1. eliminate the
dynamic folio_batch allocation and 2. drop the flush on hole mapping
case, would that address the issues with iomap zero range for ext4?

> 
> [1] https://lore.kernel.org/linux-ext4/20241022111059.2566137-1-yi.zhang@huaweicloud.com/
> 
> > 
> > Brian
> > 
> > P.S., I'm heading on vacation so it will likely be a week or two before
> > I follow up from here, JFYI.
> 
> Wishing you a wonderful time! :-)

Thanks!

Brian

> 
> Best regards,
> Yi.
> 
> >>
> >>>> +		return offset + length;
> >>>> +	folio_batch_init(iter->fbatch);
> >>>> +
> >>>> +	filemap_get_folios_dirty(mapping, &start, end, iter->fbatch);
> >>>> +	return (start << PAGE_SHIFT);
> >>>> +}
> >>>> +EXPORT_SYMBOL_GPL(iomap_fill_dirty_folios);
> >>
> 



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 3/7] iomap: optional zero range dirty folio processing
  2025-07-28 12:57             ` Zhang Yi
@ 2025-07-30 13:19               ` Brian Foster
  2025-08-02  7:26                 ` Zhang Yi
  0 siblings, 1 reply; 37+ messages in thread
From: Brian Foster @ 2025-07-30 13:19 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-fsdevel, linux-xfs, linux-mm, hch, willy, Darrick J. Wong,
	Ext4 Developers List

On Mon, Jul 28, 2025 at 08:57:28PM +0800, Zhang Yi wrote:
> On 2025/7/21 16:47, Zhang Yi wrote:
> > On 2025/7/19 19:07, Zhang Yi wrote:
> >> On 2025/7/18 21:48, Brian Foster wrote:
> >>> On Fri, Jul 18, 2025 at 07:30:10PM +0800, Zhang Yi wrote:
> >>>> On 2025/7/15 13:22, Darrick J. Wong wrote:
> >>>>> On Mon, Jul 14, 2025 at 04:41:18PM -0400, Brian Foster wrote:
> >>>>>> The only way zero range can currently process unwritten mappings
> >>>>>> with dirty pagecache is to check whether the range is dirty before
> >>>>>> mapping lookup and then flush when at least one underlying mapping
> >>>>>> is unwritten. This ordering is required to prevent iomap lookup from
> >>>>>> racing with folio writeback and reclaim.
> >>>>>>
> >>>>>> Since zero range can skip ranges of unwritten mappings that are
> >>>>>> clean in cache, this operation can be improved by allowing the
> >>>>>> filesystem to provide a set of dirty folios that require zeroing. In
> >>>>>> turn, rather than flush or iterate file offsets, zero range can
> >>>>>> iterate on folios in the batch and advance over clean or uncached
> >>>>>> ranges in between.
> >>>>>>
> >>>>>> Add a folio_batch in struct iomap and provide a helper for fs' to
> >>>>>
> >>>>> /me confused by the single quote; is this supposed to read:
> >>>>>
> >>>>> "...for the fs to populate..."?
> >>>>>
> >>>>> Either way the code changes look like a reasonable thing to do for the
> >>>>> pagecache (try to grab a bunch of dirty folios while XFS holds the
> >>>>> mapping lock) so
> >>>>>
> >>>>> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
> >>>>>
> >>>>> --D
> >>>>>
> >>>>>
> >>>>>> populate the batch at lookup time. Update the folio lookup path to
> >>>>>> return the next folio in the batch, if provided, and advance the
> >>>>>> iter if the folio starts beyond the current offset.
> >>>>>>
> >>>>>> Signed-off-by: Brian Foster <bfoster@redhat.com>
> >>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
> >>>>>> ---
> >>>>>>  fs/iomap/buffered-io.c | 89 +++++++++++++++++++++++++++++++++++++++---
> >>>>>>  fs/iomap/iter.c        |  6 +++
> >>>>>>  include/linux/iomap.h  |  4 ++
> >>>>>>  3 files changed, 94 insertions(+), 5 deletions(-)
> >>>>>>
> >>>>>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> >>>>>> index 38da2fa6e6b0..194e3cc0857f 100644
> >>>>>> --- a/fs/iomap/buffered-io.c
> >>>>>> +++ b/fs/iomap/buffered-io.c
> >>>> [...]
> >>>>>> @@ -1398,6 +1452,26 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
> >>>>>>  	return status;
> >>>>>>  }
> >>>>>>  
> >>>>>> +loff_t
> >>>>>> +iomap_fill_dirty_folios(
> >>>>>> +	struct iomap_iter	*iter,
> >>>>>> +	loff_t			offset,
> >>>>>> +	loff_t			length)
> >>>>>> +{
> >>>>>> +	struct address_space	*mapping = iter->inode->i_mapping;
> >>>>>> +	pgoff_t			start = offset >> PAGE_SHIFT;
> >>>>>> +	pgoff_t			end = (offset + length - 1) >> PAGE_SHIFT;
> >>>>>> +
> >>>>>> +	iter->fbatch = kmalloc(sizeof(struct folio_batch), GFP_KERNEL);
> >>>>>> +	if (!iter->fbatch)
> >>>>
> >>>> Hi, Brian!
> >>>>
> >>>> I think ext4 needs to be aware of this failure after it converts to use
> >>>> iomap infrastructure. It is because if we fail to add dirty folios to the
> >>>> fbatch, iomap_zero_range() will flush those unwritten and dirty range.
> >>>> This could potentially lead to a deadlock, as most calls to
> >>>> ext4_block_zero_page_range() occur under an active journal handle.
> >>>> Writeback operations under an active journal handle may result in circular
> >>>> waiting within journal transactions. So please return this error code, and
> >>>> then ext4 can interrupt zero operations to prevent deadlock.
> >>>>
> >>>
> >>> Hi Yi,
> >>>
> >>> Thanks for looking at this.
> >>>
> >>> Huh.. so the reason for falling back like this here is just that this
> >>> was considered an optional optimization, with the flush in
> >>> iomap_zero_range() being default fallback behavior. IIUC, what you're
> >>> saying means that the current zero range behavior without this series is
> >>> problematic for ext4-on-iomap..? 
> >>
> >> Yes.
> >>
> >>> If so, have you observed issues you can share details about?
> >>
> >> Sure.
> >>
> >> Before delving into the specific details of this issue, I would like
> >> to provide some background information on the rule that ext4 cannot
> >> wait for writeback in an active journal handle. If you are aware of
> >> this background, please skip this paragraph. During ext4 writing back
> >> the page cache, it may start a new journal handle to allocate blocks,
> >> update the disksize, and convert unwritten extents after the I/O is
> >> completed. When starting this new journal handle, if the current
> >> running journal transaction is in the process of being submitted or
> >> if the journal space is insufficient, it must wait for the ongoing
> >> transaction to be completed, but the prerequisite for this is that all
> >> currently running handles must be terminated. However, if we flush the
> >> page cache under an active journal handle, we cannot stop it, which
> >> may lead to a deadlock.
> >>
> >> Now, the issue I have observed occurs when I attempt to use
> >> iomap_zero_range() within ext4_block_zero_page_range(). My current
> >> implementation are below(based on the latest fs-next).
> >>
> >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> >> index 28547663e4fd..1a21667f3f7c 100644
> >> --- a/fs/ext4/inode.c
> >> +++ b/fs/ext4/inode.c
> >> @@ -4147,6 +4147,53 @@ static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
> >>  	return 0;
> >>  }
> >>
> >> +static int ext4_iomap_buffered_zero_begin(struct inode *inode, loff_t offset,
> >> +			loff_t length, unsigned int flags, struct iomap *iomap,
> >> +			struct iomap *srcmap)
> >> +{
> >> +	struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap);
> >> +	struct ext4_map_blocks map;
> >> +	u8 blkbits = inode->i_blkbits;
> >> +	int ret;
> >> +
> >> +	ret = ext4_emergency_state(inode->i_sb);
> >> +	if (unlikely(ret))
> >> +		return ret;
> >> +
> >> +	if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
> >> +		return -EINVAL;
> >> +
> >> +	/* Calculate the first and last logical blocks respectively. */
> >> +	map.m_lblk = offset >> blkbits;
> >> +	map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
> >> +			  EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1;
> >> +
> >> +	ret = ext4_map_blocks(NULL, inode, &map, 0);
> >> +	if (ret < 0)
> >> +		return ret;
> >> +
> >> +	/*
> >> +	 * Look up dirty folios for unwritten mappings within EOF. Providing
> >> +	 * this bypasses the flush iomap uses to trigger extent conversion
> >> +	 * when unwritten mappings have dirty pagecache in need of zeroing.
> >> +	 */
> >> +	if ((map.m_flags & EXT4_MAP_UNWRITTEN) &&
> >> +	    map.m_lblk < EXT4_B_TO_LBLK(inode, i_size_read(inode))) {
> >> +		loff_t end;
> >> +
> >> +		end = iomap_fill_dirty_folios(iter, map.m_lblk << blkbits,
> >> +					      map.m_len << blkbits);
> >> +		if ((end >> blkbits) < map.m_lblk + map.m_len)
> >> +			map.m_len = (end >> blkbits) - map.m_lblk;
> >> +	}
> >> +
> >> +	ext4_set_iomap(inode, iomap, &map, offset, length, flags);
> >> +	return 0;
> >> +}
> >> +
> >> +const struct iomap_ops ext4_iomap_buffered_zero_ops = {
> >> +	.iomap_begin = ext4_iomap_buffered_zero_begin,
> >> +};
> >>
> >>  const struct iomap_ops ext4_iomap_buffered_write_ops = {
> >>  	.iomap_begin = ext4_iomap_buffered_write_begin,
> >> @@ -4611,6 +4658,17 @@ static int __ext4_block_zero_page_range(handle_t *handle,
> >>  	return err;
> >>  }
> >>
> >> +static inline int ext4_iomap_zero_range(struct inode *inode, loff_t from,
> >> +					loff_t length)
> >> +{
> >> +	WARN_ON_ONCE(!inode_is_locked(inode) &&
> >> +		     !rwsem_is_locked(&inode->i_mapping->invalidate_lock));
> >> +
> >> +	return iomap_zero_range(inode, from, length, NULL,
> >> +				&ext4_iomap_buffered_zero_ops,
> >> +				&ext4_iomap_write_ops, NULL);
> >> +}
> >> +
> >>  /*
> >>   * ext4_block_zero_page_range() zeros out a mapping of length 'length'
> >>   * starting from file offset 'from'.  The range to be zero'd must
> >> @@ -4636,6 +4694,8 @@ static int ext4_block_zero_page_range(handle_t *handle,
> >>  	if (IS_DAX(inode)) {
> >>  		return dax_zero_range(inode, from, length, NULL,
> >>  				      &ext4_iomap_ops);
> >> +	} else if (ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP)) {
> >> +		return ext4_iomap_zero_range(inode, from, length);
> >>  	}
> >>  	return __ext4_block_zero_page_range(handle, mapping, from, length);
> >>  }
> >>
> >> The problem is most calls to ext4_block_zero_page_range() occur under
> >> an active journal handle, so I can reproduce the deadlock issue easily
> >> without this series.
> >>
> >>>
> >>> FWIW, I think your suggestion is reasonable, but I'm also curious what
> >>> the error handling would look like in ext4. Do you expect to the fail
> >>> the higher level operation, for example? Cycle locks and retry, etc.?
> >>
> >> Originally, I wanted ext4_block_zero_page_range() to return a failure
> >> to the higher level operation. However, unfortunately, after my testing
> >> today, I discovered that even though we implement this, this series still
> >> cannot resolve the issue. The corner case is:
> >>
> >> Assume we have a dirty folio covers both hole and unwritten mappings.
> >>
> >>    |- dirty folio  -|
> >>    [hhhhhhhhuuuuuuuu]                h:hole, u:unwrtten
> >>
> >> If we punch the range of the hole, ext4_punch_hole()->
> >> ext4_zero_partial_blocks() will zero out the first half of the dirty folio.
> >> Then, ext4_iomap_buffered_zero_begin() will skip adding this dirty folio
> >> since the target range is a hole. Finally, iomap_zero_range() will still
> >> flush this whole folio and lead to deadlock during writeback the latter
> >> half of the folio.
> >>
> >>>
> >>> The reason I ask is because the folio_batch handling has come up through
> >>> discussions on this series. My position so far has been to keep it as a
> >>> separate allocation and to keep things simple since it is currently
> >>> isolated to zero range, but that may change if the usage spills over to
> >>> other operations (which seems expected at this point). I suspect that if
> >>> a filesystem actually depends on this for correct behavior, that is
> >>> another data point worth considering on that topic.
> >>>
> >>> So that has me wondering if it would be better/easier here to perhaps
> >>> embed the batch in iomap_iter, or maybe as an incremental step put it on
> >>> the stack in iomap_zero_range() and initialize the iomap_iter pointer
> >>> there instead of doing the dynamic allocation (then the fill helper
> >>> would set a flag to indicate the fs did pagecache lookup). Thoughts on
> >>> something like that?
> >>>
> >>> Also IIUC ext4-on-iomap is still a WIP and review on this series seems
> >>> to have mostly wound down. Any objection if the fix for that comes along
> >>> as a followup patch rather than a rework of this series?
> >>
> >> It seems that we don't need to modify this series, we need to consider
> >> other solutions to resolve this deadlock issue.
> >>
> >> In my v1 ext4-on-iomap series [1], I resolved this issue by moving all
> >> instances of ext4_block_zero_page_range() out of the running journal
> >> handle(please see patch 19-21). But I don't think this is a good solution
> >> since it's complex and fragile. Besides, after commit c7fc0366c6562
> >> ("ext4: partial zero eof block on unaligned inode size extension"), you
> >> added more invocations of ext4_zero_partial_blocks(), and the situation
> >> has become more complicated (Althrough I think the calls in the three
> >> write_end callbacks can be removed).
> >>
> >> Besides, IIUC, it seems that ext4 doesn't need to flush dirty folios
> >> over unwritten mappings before zeroing partial blocks. This is because
> >> ext4 always zeroes the in-memory page cache before zeroing(e.g, in
> >> ext4_setattr() and ext4_punch_hole()), it means if the target range is
> >> still dirty and unwritten when calling ext4_block_zero_page_range(), it
> >> must has already been zeroed. Was I missing something? Therefore, I was
> >> wondering if there are any ways to prevent flushing in
> >> iomap_zero_range()? Any ideas?
> >>
> > 
> > The commit 7d9b474ee4cc ("iomap: make zero range flush conditional on
> > unwritten mappings") mentioned the following:
> > 
> >   iomap_zero_range() flushes pagecache to mitigate consistency
> >   problems with dirty pagecache and unwritten mappings. The flush is
> >   unconditional over the entire range because checking pagecache state
> >   after mapping lookup is racy with writeback and reclaim. There are
> >   ways around this using iomap's mapping revalidation mechanism, but
> >   this is not supported by all iomap based filesystems and so is not a
> >   generic solution.
> > 
> > Does the revalidation mechanism here refer to verifying the validity of
> > the mapping through iomap_write_ops->iomap_valid()? IIUC, does this mean
> > that if the filesystem implement the iomap_valid() interface, we can
> > always avoid the iomap_zero_range() from flushing dirty folios back?
> > Something like below:
> > 
> > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > index 73772d34f502..ba71a6ed2f77 100644
> > --- a/fs/iomap/buffered-io.c
> > +++ b/fs/iomap/buffered-io.c
> > @@ -1522,7 +1522,10 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
> > 
> > 			if (range_dirty) {
> > 				range_dirty = false;
> > -				status = iomap_zero_iter_flush_and_stale(&iter);
> > +				if (write_ops->iomap_valid)
> > +					status = iomap_zero_iter(&iter, did_zero, write_ops);
> > +				else
> > +					status = iomap_zero_iter_flush_and_stale(&iter);
> > 			} else {
> > 				status = iomap_iter_advance_full(&iter);
> > 			}
> >
> 
> The above diff will trigger
> WARN_ON_ONCE(folio_pos(folio) > iter->inode->i_size) in iomap_zero_iter()
> on XFS. I revised the 'diff' and ran xfstests with several main configs
> on both XFS and ext4(with iomap infrastructure), and everything seems to
> be working fine so far. What do you think?
> 

A couple things here.. First, I don't think it's quite enough to assume
zeroing is safe just because a revalidation callback is defined. I have
multiple (old) prototypes around fixing zero range, one of which was
centered around using ->iomap_valid(), so I do believe it's technically
possible with further changes. That's what the above "There are ways
around this using the revalidation mechanism" text was referring to
generally.

That said, I don't think going down that path is the right solution
here. I opted for the folio batch approach in favor of that one because
it is more generic, for one. Based on a lot of the discussion around
this series, it also seems to be more broadly useful. For example, there
is potential to use for other operations, including buffered writes. If
that pans out (no guarantees of course), then the fill thing becomes
more of a generic iomap step vs. something called by the fs. That likely
means the flush this is trying to work around can also go away entirely.

I.e., the flush is really just a fallback mechanism for fs' that don't
care or know any better, since there is no guarantee the callback fills
the batch and a flush is better than silent data corruption. So I'd
really like to avoid trying to reinvent things here if at all possible.
I'm curious if the tweaks proposed in the previous reply would be
sufficient for ext4. If so, I can dig into that after rolling the next
version of this series..

Brian

> Thanks,
> Yi.
> 
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 73772d34f502..a75cdb22bab0 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -1444,9 +1444,6 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero,
>  			break;
>  		}
> 
> -		/* warn about zeroing folios beyond eof that won't write back */
> -		WARN_ON_ONCE(folio_pos(folio) > iter->inode->i_size);
> -
>  		folio_zero_range(folio, offset, bytes);
>  		folio_mark_accessed(folio);
> 
> @@ -1515,22 +1512,44 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
>  				 srcmap->type != IOMAP_UNWRITTEN))
>  			return -EIO;
> 
> -		if (!iter.fbatch &&
> -		    (srcmap->type == IOMAP_HOLE ||
> -		     srcmap->type == IOMAP_UNWRITTEN)) {
> -			s64 status;
> +		if (iter.fbatch || (srcmap->type != IOMAP_HOLE &&
> +				    srcmap->type != IOMAP_UNWRITTEN)) {
> +			iter.status = iomap_zero_iter(&iter, did_zero,
> +						      write_ops);
> +			continue;
> +		}
> 
> -			if (range_dirty) {
> -				range_dirty = false;
> -				status = iomap_zero_iter_flush_and_stale(&iter);
> -			} else {
> -				status = iomap_iter_advance_full(&iter);
> -			}
> -			iter.status = status;
> +		/*
> +		 * No fbatch, and the target is either a hole or an unwritten
> +		 * range, skip zeroing if the range is not dirty.
> +		 */
> +		if (!range_dirty) {
> +			iter.status = iomap_iter_advance_full(&iter);
>  			continue;
>  		}
> 
> -		iter.status = iomap_zero_iter(&iter, did_zero, write_ops);
> +		/*
> +		 * The range is dirty, if the given filesystm does not specify
> +		 * a revalidation mechanism, flush the entire range to prevent
> +		 * mapping changes that could race with writeback and reclaim.
> +		 */
> +		if (!write_ops->iomap_valid) {
> +			range_dirty = false;
> +			iter.status = iomap_zero_iter_flush_and_stale(&iter);
> +			continue;
> +		}
> +
> +		/*
> +		 * The filesystem specifies an iomap_valid() helper. It is safe
> +		 * to zero out the current range if it is unwritten and dirty.
> +		 */
> +		if (srcmap->type == IOMAP_UNWRITTEN &&
> +		    filemap_range_needs_writeback(mapping, iter.pos,
> +						  iomap_length(&iter)))
> +			iter.status = iomap_zero_iter(&iter, did_zero,
> +						      write_ops);
> +		else
> +			iter.status = iomap_iter_advance_full(&iter);
>  	}
>  	return ret;
>  }
> 
> 



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 3/7] iomap: optional zero range dirty folio processing
  2025-07-30 13:17           ` Brian Foster
@ 2025-08-02  7:19             ` Zhang Yi
  2025-08-05 13:08               ` Brian Foster
  0 siblings, 1 reply; 37+ messages in thread
From: Zhang Yi @ 2025-08-02  7:19 UTC (permalink / raw)
  To: Brian Foster
  Cc: linux-fsdevel, linux-xfs, linux-mm, hch, willy, Darrick J. Wong,
	Ext4 Developers List

On 2025/7/30 21:17, Brian Foster wrote:
> On Sat, Jul 19, 2025 at 07:07:43PM +0800, Zhang Yi wrote:
>> On 2025/7/18 21:48, Brian Foster wrote:
>>> On Fri, Jul 18, 2025 at 07:30:10PM +0800, Zhang Yi wrote:
>>>> On 2025/7/15 13:22, Darrick J. Wong wrote:
>>>>> On Mon, Jul 14, 2025 at 04:41:18PM -0400, Brian Foster wrote:
>>>>>> The only way zero range can currently process unwritten mappings
>>>>>> with dirty pagecache is to check whether the range is dirty before
>>>>>> mapping lookup and then flush when at least one underlying mapping
>>>>>> is unwritten. This ordering is required to prevent iomap lookup from
>>>>>> racing with folio writeback and reclaim.
>>>>>>
>>>>>> Since zero range can skip ranges of unwritten mappings that are
>>>>>> clean in cache, this operation can be improved by allowing the
>>>>>> filesystem to provide a set of dirty folios that require zeroing. In
>>>>>> turn, rather than flush or iterate file offsets, zero range can
>>>>>> iterate on folios in the batch and advance over clean or uncached
>>>>>> ranges in between.
>>>>>>
>>>>>> Add a folio_batch in struct iomap and provide a helper for fs' to
>>>>>
>>>>> /me confused by the single quote; is this supposed to read:
>>>>>
>>>>> "...for the fs to populate..."?
>>>>>
>>>>> Either way the code changes look like a reasonable thing to do for the
>>>>> pagecache (try to grab a bunch of dirty folios while XFS holds the
>>>>> mapping lock) so
>>>>>
>>>>> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
>>>>>
>>>>> --D
>>>>>
>>>>>
>>>>>> populate the batch at lookup time. Update the folio lookup path to
>>>>>> return the next folio in the batch, if provided, and advance the
>>>>>> iter if the folio starts beyond the current offset.
>>>>>>
>>>>>> Signed-off-by: Brian Foster <bfoster@redhat.com>
>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>>>>>> ---
>>>>>>  fs/iomap/buffered-io.c | 89 +++++++++++++++++++++++++++++++++++++++---
>>>>>>  fs/iomap/iter.c        |  6 +++
>>>>>>  include/linux/iomap.h  |  4 ++
>>>>>>  3 files changed, 94 insertions(+), 5 deletions(-)
>>>>>>
>>>>>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
>>>>>> index 38da2fa6e6b0..194e3cc0857f 100644
>>>>>> --- a/fs/iomap/buffered-io.c
>>>>>> +++ b/fs/iomap/buffered-io.c
>>>> [...]
>>>>>> @@ -1398,6 +1452,26 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
>>>>>>  	return status;
>>>>>>  }
>>>>>>  
>>>>>> +loff_t
>>>>>> +iomap_fill_dirty_folios(
>>>>>> +	struct iomap_iter	*iter,
>>>>>> +	loff_t			offset,
>>>>>> +	loff_t			length)
>>>>>> +{
>>>>>> +	struct address_space	*mapping = iter->inode->i_mapping;
>>>>>> +	pgoff_t			start = offset >> PAGE_SHIFT;
>>>>>> +	pgoff_t			end = (offset + length - 1) >> PAGE_SHIFT;
>>>>>> +
>>>>>> +	iter->fbatch = kmalloc(sizeof(struct folio_batch), GFP_KERNEL);
>>>>>> +	if (!iter->fbatch)
>>>>
>>>> Hi, Brian!
>>>>
>>>> I think ext4 needs to be aware of this failure after it converts to use
>>>> iomap infrastructure. It is because if we fail to add dirty folios to the
>>>> fbatch, iomap_zero_range() will flush those unwritten and dirty range.
>>>> This could potentially lead to a deadlock, as most calls to
>>>> ext4_block_zero_page_range() occur under an active journal handle.
>>>> Writeback operations under an active journal handle may result in circular
>>>> waiting within journal transactions. So please return this error code, and
>>>> then ext4 can interrupt zero operations to prevent deadlock.
>>>>
>>>
>>> Hi Yi,
>>>
>>> Thanks for looking at this.
>>>
>>> Huh.. so the reason for falling back like this here is just that this
>>> was considered an optional optimization, with the flush in
>>> iomap_zero_range() being default fallback behavior. IIUC, what you're
>>> saying means that the current zero range behavior without this series is
>>> problematic for ext4-on-iomap..? 
>>
>> Yes.
>>
>>> If so, have you observed issues you can share details about?
>>
>> Sure.
>>
>> Before delving into the specific details of this issue, I would like
>> to provide some background information on the rule that ext4 cannot
>> wait for writeback in an active journal handle. If you are aware of
>> this background, please skip this paragraph. During ext4 writing back
>> the page cache, it may start a new journal handle to allocate blocks,
>> update the disksize, and convert unwritten extents after the I/O is
>> completed. When starting this new journal handle, if the current
>> running journal transaction is in the process of being submitted or
>> if the journal space is insufficient, it must wait for the ongoing
>> transaction to be completed, but the prerequisite for this is that all
>> currently running handles must be terminated. However, if we flush the
>> page cache under an active journal handle, we cannot stop it, which
>> may lead to a deadlock.
>>
> 
> Ok, makes sense.
> 
>> Now, the issue I have observed occurs when I attempt to use
>> iomap_zero_range() within ext4_block_zero_page_range(). My current
>> implementation are below(based on the latest fs-next).
>>
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index 28547663e4fd..1a21667f3f7c 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -4147,6 +4147,53 @@ static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
>>  	return 0;
>>  }
>>
>> +static int ext4_iomap_buffered_zero_begin(struct inode *inode, loff_t offset,
>> +			loff_t length, unsigned int flags, struct iomap *iomap,
>> +			struct iomap *srcmap)
>> +{
>> +	struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap);
>> +	struct ext4_map_blocks map;
>> +	u8 blkbits = inode->i_blkbits;
>> +	int ret;
>> +
>> +	ret = ext4_emergency_state(inode->i_sb);
>> +	if (unlikely(ret))
>> +		return ret;
>> +
>> +	if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
>> +		return -EINVAL;
>> +
>> +	/* Calculate the first and last logical blocks respectively. */
>> +	map.m_lblk = offset >> blkbits;
>> +	map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
>> +			  EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1;
>> +
>> +	ret = ext4_map_blocks(NULL, inode, &map, 0);
>> +	if (ret < 0)
>> +		return ret;
>> +
>> +	/*
>> +	 * Look up dirty folios for unwritten mappings within EOF. Providing
>> +	 * this bypasses the flush iomap uses to trigger extent conversion
>> +	 * when unwritten mappings have dirty pagecache in need of zeroing.
>> +	 */
>> +	if ((map.m_flags & EXT4_MAP_UNWRITTEN) &&
>> +	    map.m_lblk < EXT4_B_TO_LBLK(inode, i_size_read(inode))) {
>> +		loff_t end;
>> +
>> +		end = iomap_fill_dirty_folios(iter, map.m_lblk << blkbits,
>> +					      map.m_len << blkbits);
>> +		if ((end >> blkbits) < map.m_lblk + map.m_len)
>> +			map.m_len = (end >> blkbits) - map.m_lblk;
>> +	}
>> +
>> +	ext4_set_iomap(inode, iomap, &map, offset, length, flags);
>> +	return 0;
>> +}
>> +
>> +const struct iomap_ops ext4_iomap_buffered_zero_ops = {
>> +	.iomap_begin = ext4_iomap_buffered_zero_begin,
>> +};
>>
>>  const struct iomap_ops ext4_iomap_buffered_write_ops = {
>>  	.iomap_begin = ext4_iomap_buffered_write_begin,
>> @@ -4611,6 +4658,17 @@ static int __ext4_block_zero_page_range(handle_t *handle,
>>  	return err;
>>  }
>>
>> +static inline int ext4_iomap_zero_range(struct inode *inode, loff_t from,
>> +					loff_t length)
>> +{
>> +	WARN_ON_ONCE(!inode_is_locked(inode) &&
>> +		     !rwsem_is_locked(&inode->i_mapping->invalidate_lock));
>> +
>> +	return iomap_zero_range(inode, from, length, NULL,
>> +				&ext4_iomap_buffered_zero_ops,
>> +				&ext4_iomap_write_ops, NULL);
>> +}
>> +
>>  /*
>>   * ext4_block_zero_page_range() zeros out a mapping of length 'length'
>>   * starting from file offset 'from'.  The range to be zero'd must
>> @@ -4636,6 +4694,8 @@ static int ext4_block_zero_page_range(handle_t *handle,
>>  	if (IS_DAX(inode)) {
>>  		return dax_zero_range(inode, from, length, NULL,
>>  				      &ext4_iomap_ops);
>> +	} else if (ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP)) {
>> +		return ext4_iomap_zero_range(inode, from, length);
>>  	}
>>  	return __ext4_block_zero_page_range(handle, mapping, from, length);
>>  }
>>
>> The problem is most calls to ext4_block_zero_page_range() occur under
>> an active journal handle, so I can reproduce the deadlock issue easily
>> without this series.
>>
>>>
>>> FWIW, I think your suggestion is reasonable, but I'm also curious what
>>> the error handling would look like in ext4. Do you expect to the fail
>>> the higher level operation, for example? Cycle locks and retry, etc.?
>>
>> Originally, I wanted ext4_block_zero_page_range() to return a failure
>> to the higher level operation. However, unfortunately, after my testing
>> today, I discovered that even though we implement this, this series still
>> cannot resolve the issue. The corner case is:
>>
>> Assume we have a dirty folio covers both hole and unwritten mappings.
>>
>>    |- dirty folio  -|
>>    [hhhhhhhhuuuuuuuu]                h:hole, u:unwrtten
>>
>> If we punch the range of the hole, ext4_punch_hole()->
>> ext4_zero_partial_blocks() will zero out the first half of the dirty folio.
>> Then, ext4_iomap_buffered_zero_begin() will skip adding this dirty folio
>> since the target range is a hole. Finally, iomap_zero_range() will still
>> flush this whole folio and lead to deadlock during writeback the latter
>> half of the folio.
>>
> 
> Hmm.. Ok. So it seems there are at least a couple ways around this
> particular quirk. I suspect one is that you could just call the fill
> helper in the hole case as well, but that's kind of a hack and not
> really intended use.
> 
> The other way goes back to the fact that the flush for the hole case was
> kind of a corner case hack in the first place. The original comment for
> that seems to have been dropped, but see commit 7d9b474ee4cc ("iomap:
> make zero range flush conditional on unwritten mappings") for reference
> to the original intent.
> 
> I'd have to go back and investigate if something regresses with that
> taken out, but my recollection is that was something that needed proper
> fixing eventually anyways. I'm particularly wondering if that is no
> longer an issue now that pagecache_isize_extended() handles the post-eof
> zeroing (the caveat being we might just need to call it in some
> additional size extension cases besides just setattr/truncate).

Yeah, I agree with you. I suppose the post-EOF partial folio zeroing in
pagecache_isize_extended() should work.

> 
>>>
>>> The reason I ask is because the folio_batch handling has come up through
>>> discussions on this series. My position so far has been to keep it as a
>>> separate allocation and to keep things simple since it is currently
>>> isolated to zero range, but that may change if the usage spills over to
>>> other operations (which seems expected at this point). I suspect that if
>>> a filesystem actually depends on this for correct behavior, that is
>>> another data point worth considering on that topic.
>>>
>>> So that has me wondering if it would be better/easier here to perhaps
>>> embed the batch in iomap_iter, or maybe as an incremental step put it on
>>> the stack in iomap_zero_range() and initialize the iomap_iter pointer
>>> there instead of doing the dynamic allocation (then the fill helper
>>> would set a flag to indicate the fs did pagecache lookup). Thoughts on
>>> something like that?
>>>
>>> Also IIUC ext4-on-iomap is still a WIP and review on this series seems
>>> to have mostly wound down. Any objection if the fix for that comes along
>>> as a followup patch rather than a rework of this series?
>>
>> It seems that we don't need to modify this series, we need to consider
>> other solutions to resolve this deadlock issue.
>>
>> In my v1 ext4-on-iomap series [1], I resolved this issue by moving all
>> instances of ext4_block_zero_page_range() out of the running journal
>> handle(please see patch 19-21). But I don't think this is a good solution
>> since it's complex and fragile. Besides, after commit c7fc0366c6562
>> ("ext4: partial zero eof block on unaligned inode size extension"), you
>> added more invocations of ext4_zero_partial_blocks(), and the situation
>> has become more complicated (Althrough I think the calls in the three
>> write_end callbacks can be removed).
>>
>> Besides, IIUC, it seems that ext4 doesn't need to flush dirty folios
>> over unwritten mappings before zeroing partial blocks. This is because
>> ext4 always zeroes the in-memory page cache before zeroing(e.g, in
>> ext4_setattr() and ext4_punch_hole()), it means if the target range is
>> still dirty and unwritten when calling ext4_block_zero_page_range(), it
>> must has already been zeroed. Was I missing something? Therefore, I was
>> wondering if there are any ways to prevent flushing in
>> iomap_zero_range()? Any ideas?
> 
> It's certainly possible that the quirk fixed by the flush the hole case
> was never a problem on ext4, if that's what you mean. Most of the
> testing for this was on XFS since ext4 hadn't used iomap for buffered
> writes.
> 
> At the end of the day, the batch mechanism is intended to facilitate
> avoiding the flush entirely. I'm still paging things back in here.. but
> if we had two smallish changes to this code path to 1. eliminate the
> dynamic folio_batch allocation and 2. drop the flush on hole mapping
> case, would that address the issues with iomap zero range for ext4?
> 

Thank you for looking at this!

I made a simple modification to the iomap_zero_range() function based
on the second solution you mentioned, then tested it using kvm-xfstests
these days. This solution works fine on ext4 and I don't find any other
risks by now. (Since my testing environment has sufficient memory, I
have not yet handled the case of memory allocation failure).

--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1520,7 +1520,7 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
		     srcmap->type == IOMAP_UNWRITTEN)) {
			s64 status;

-			if (range_dirty) {
+			if (range_dirty && srcmap->type == IOMAP_UNWRITTEN) {
				range_dirty = false;
				status = iomap_zero_iter_flush_and_stale(&iter);
			} else {

Another thing I want to mention (although there are no real issues at
the moment, I still want to mention it) is that there appears to be
no consistency guarantee between the lookup of the mapping and the
follo_batch. For example, assume we have a file which contains two
dirty folio and two unwritten extents, one folio corresponds to one
extent. We zero out these two folios.

    | dirty folio 1  || dirty folio 2   |
    [uuuuuuuuuuuuuuuu][uuuuuuuuuuuuuuuuu]

In the first call to ->iomap_begin(), we get the unwritten extent 1.
At the same time, another thread writes back folio 1 and clears this
folio, so this folio will not be added to the follo_batch. Then
iomap_zero_range() will still flush those two folios. When flushing
the second folio, there is still a risk of deadlock due to changes in
metadata.

However, since ext4 currently uses this interface only to zero out
partial block, so this situation will not happen, but if the usage
changes in the future, we should be very careful about this point.
So in the future, I hope to have a more reliable method to avoid
flushing in iomap_zero_range().

Therefore, at the moment, I think that solving the problem through
these two points is feasible (I hope I haven't missed anything :-) ),
though it is somewhat fragile. What do other ext4 developers think?

Thanks,
Yi.



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 3/7] iomap: optional zero range dirty folio processing
  2025-07-30 13:19               ` Brian Foster
@ 2025-08-02  7:26                 ` Zhang Yi
  0 siblings, 0 replies; 37+ messages in thread
From: Zhang Yi @ 2025-08-02  7:26 UTC (permalink / raw)
  To: Brian Foster
  Cc: linux-fsdevel, linux-xfs, linux-mm, hch, willy, Darrick J. Wong,
	Ext4 Developers List

On 2025/7/30 21:19, Brian Foster wrote:
> On Mon, Jul 28, 2025 at 08:57:28PM +0800, Zhang Yi wrote:
>> On 2025/7/21 16:47, Zhang Yi wrote:
>>> On 2025/7/19 19:07, Zhang Yi wrote:
>>>> On 2025/7/18 21:48, Brian Foster wrote:
>>>>> On Fri, Jul 18, 2025 at 07:30:10PM +0800, Zhang Yi wrote:
>>>>>> On 2025/7/15 13:22, Darrick J. Wong wrote:
>>>>>>> On Mon, Jul 14, 2025 at 04:41:18PM -0400, Brian Foster wrote:
>>>>>>>> The only way zero range can currently process unwritten mappings
>>>>>>>> with dirty pagecache is to check whether the range is dirty before
>>>>>>>> mapping lookup and then flush when at least one underlying mapping
>>>>>>>> is unwritten. This ordering is required to prevent iomap lookup from
>>>>>>>> racing with folio writeback and reclaim.
>>>>>>>>
>>>>>>>> Since zero range can skip ranges of unwritten mappings that are
>>>>>>>> clean in cache, this operation can be improved by allowing the
>>>>>>>> filesystem to provide a set of dirty folios that require zeroing. In
>>>>>>>> turn, rather than flush or iterate file offsets, zero range can
>>>>>>>> iterate on folios in the batch and advance over clean or uncached
>>>>>>>> ranges in between.
>>>>>>>>
>>>>>>>> Add a folio_batch in struct iomap and provide a helper for fs' to
>>>>>>>
>>>>>>> /me confused by the single quote; is this supposed to read:
>>>>>>>
>>>>>>> "...for the fs to populate..."?
>>>>>>>
>>>>>>> Either way the code changes look like a reasonable thing to do for the
>>>>>>> pagecache (try to grab a bunch of dirty folios while XFS holds the
>>>>>>> mapping lock) so
>>>>>>>
>>>>>>> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
>>>>>>>
>>>>>>> --D
>>>>>>>
>>>>>>>
>>>>>>>> populate the batch at lookup time. Update the folio lookup path to
>>>>>>>> return the next folio in the batch, if provided, and advance the
>>>>>>>> iter if the folio starts beyond the current offset.
>>>>>>>>
>>>>>>>> Signed-off-by: Brian Foster <bfoster@redhat.com>
>>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>>>>>>>> ---
>>>>>>>>  fs/iomap/buffered-io.c | 89 +++++++++++++++++++++++++++++++++++++++---
>>>>>>>>  fs/iomap/iter.c        |  6 +++
>>>>>>>>  include/linux/iomap.h  |  4 ++
>>>>>>>>  3 files changed, 94 insertions(+), 5 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
>>>>>>>> index 38da2fa6e6b0..194e3cc0857f 100644
>>>>>>>> --- a/fs/iomap/buffered-io.c
>>>>>>>> +++ b/fs/iomap/buffered-io.c
>>>>>> [...]
>>>>>>>> @@ -1398,6 +1452,26 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
>>>>>>>>  	return status;
>>>>>>>>  }
>>>>>>>>  
>>>>>>>> +loff_t
>>>>>>>> +iomap_fill_dirty_folios(
>>>>>>>> +	struct iomap_iter	*iter,
>>>>>>>> +	loff_t			offset,
>>>>>>>> +	loff_t			length)
>>>>>>>> +{
>>>>>>>> +	struct address_space	*mapping = iter->inode->i_mapping;
>>>>>>>> +	pgoff_t			start = offset >> PAGE_SHIFT;
>>>>>>>> +	pgoff_t			end = (offset + length - 1) >> PAGE_SHIFT;
>>>>>>>> +
>>>>>>>> +	iter->fbatch = kmalloc(sizeof(struct folio_batch), GFP_KERNEL);
>>>>>>>> +	if (!iter->fbatch)
>>>>>>
>>>>>> Hi, Brian!
>>>>>>
>>>>>> I think ext4 needs to be aware of this failure after it converts to use
>>>>>> iomap infrastructure. It is because if we fail to add dirty folios to the
>>>>>> fbatch, iomap_zero_range() will flush those unwritten and dirty range.
>>>>>> This could potentially lead to a deadlock, as most calls to
>>>>>> ext4_block_zero_page_range() occur under an active journal handle.
>>>>>> Writeback operations under an active journal handle may result in circular
>>>>>> waiting within journal transactions. So please return this error code, and
>>>>>> then ext4 can interrupt zero operations to prevent deadlock.
>>>>>>
>>>>>
>>>>> Hi Yi,
>>>>>
>>>>> Thanks for looking at this.
>>>>>
>>>>> Huh.. so the reason for falling back like this here is just that this
>>>>> was considered an optional optimization, with the flush in
>>>>> iomap_zero_range() being default fallback behavior. IIUC, what you're
>>>>> saying means that the current zero range behavior without this series is
>>>>> problematic for ext4-on-iomap..? 
>>>>
>>>> Yes.
>>>>
>>>>> If so, have you observed issues you can share details about?
>>>>
>>>> Sure.
>>>>
>>>> Before delving into the specific details of this issue, I would like
>>>> to provide some background information on the rule that ext4 cannot
>>>> wait for writeback in an active journal handle. If you are aware of
>>>> this background, please skip this paragraph. During ext4 writing back
>>>> the page cache, it may start a new journal handle to allocate blocks,
>>>> update the disksize, and convert unwritten extents after the I/O is
>>>> completed. When starting this new journal handle, if the current
>>>> running journal transaction is in the process of being submitted or
>>>> if the journal space is insufficient, it must wait for the ongoing
>>>> transaction to be completed, but the prerequisite for this is that all
>>>> currently running handles must be terminated. However, if we flush the
>>>> page cache under an active journal handle, we cannot stop it, which
>>>> may lead to a deadlock.
>>>>
>>>> Now, the issue I have observed occurs when I attempt to use
>>>> iomap_zero_range() within ext4_block_zero_page_range(). My current
>>>> implementation are below(based on the latest fs-next).
>>>>
>>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>>> index 28547663e4fd..1a21667f3f7c 100644
>>>> --- a/fs/ext4/inode.c
>>>> +++ b/fs/ext4/inode.c
>>>> @@ -4147,6 +4147,53 @@ static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
>>>>  	return 0;
>>>>  }
>>>>
>>>> +static int ext4_iomap_buffered_zero_begin(struct inode *inode, loff_t offset,
>>>> +			loff_t length, unsigned int flags, struct iomap *iomap,
>>>> +			struct iomap *srcmap)
>>>> +{
>>>> +	struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap);
>>>> +	struct ext4_map_blocks map;
>>>> +	u8 blkbits = inode->i_blkbits;
>>>> +	int ret;
>>>> +
>>>> +	ret = ext4_emergency_state(inode->i_sb);
>>>> +	if (unlikely(ret))
>>>> +		return ret;
>>>> +
>>>> +	if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
>>>> +		return -EINVAL;
>>>> +
>>>> +	/* Calculate the first and last logical blocks respectively. */
>>>> +	map.m_lblk = offset >> blkbits;
>>>> +	map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
>>>> +			  EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1;
>>>> +
>>>> +	ret = ext4_map_blocks(NULL, inode, &map, 0);
>>>> +	if (ret < 0)
>>>> +		return ret;
>>>> +
>>>> +	/*
>>>> +	 * Look up dirty folios for unwritten mappings within EOF. Providing
>>>> +	 * this bypasses the flush iomap uses to trigger extent conversion
>>>> +	 * when unwritten mappings have dirty pagecache in need of zeroing.
>>>> +	 */
>>>> +	if ((map.m_flags & EXT4_MAP_UNWRITTEN) &&
>>>> +	    map.m_lblk < EXT4_B_TO_LBLK(inode, i_size_read(inode))) {
>>>> +		loff_t end;
>>>> +
>>>> +		end = iomap_fill_dirty_folios(iter, map.m_lblk << blkbits,
>>>> +					      map.m_len << blkbits);
>>>> +		if ((end >> blkbits) < map.m_lblk + map.m_len)
>>>> +			map.m_len = (end >> blkbits) - map.m_lblk;
>>>> +	}
>>>> +
>>>> +	ext4_set_iomap(inode, iomap, &map, offset, length, flags);
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +const struct iomap_ops ext4_iomap_buffered_zero_ops = {
>>>> +	.iomap_begin = ext4_iomap_buffered_zero_begin,
>>>> +};
>>>>
>>>>  const struct iomap_ops ext4_iomap_buffered_write_ops = {
>>>>  	.iomap_begin = ext4_iomap_buffered_write_begin,
>>>> @@ -4611,6 +4658,17 @@ static int __ext4_block_zero_page_range(handle_t *handle,
>>>>  	return err;
>>>>  }
>>>>
>>>> +static inline int ext4_iomap_zero_range(struct inode *inode, loff_t from,
>>>> +					loff_t length)
>>>> +{
>>>> +	WARN_ON_ONCE(!inode_is_locked(inode) &&
>>>> +		     !rwsem_is_locked(&inode->i_mapping->invalidate_lock));
>>>> +
>>>> +	return iomap_zero_range(inode, from, length, NULL,
>>>> +				&ext4_iomap_buffered_zero_ops,
>>>> +				&ext4_iomap_write_ops, NULL);
>>>> +}
>>>> +
>>>>  /*
>>>>   * ext4_block_zero_page_range() zeros out a mapping of length 'length'
>>>>   * starting from file offset 'from'.  The range to be zero'd must
>>>> @@ -4636,6 +4694,8 @@ static int ext4_block_zero_page_range(handle_t *handle,
>>>>  	if (IS_DAX(inode)) {
>>>>  		return dax_zero_range(inode, from, length, NULL,
>>>>  				      &ext4_iomap_ops);
>>>> +	} else if (ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP)) {
>>>> +		return ext4_iomap_zero_range(inode, from, length);
>>>>  	}
>>>>  	return __ext4_block_zero_page_range(handle, mapping, from, length);
>>>>  }
>>>>
>>>> The problem is most calls to ext4_block_zero_page_range() occur under
>>>> an active journal handle, so I can reproduce the deadlock issue easily
>>>> without this series.
>>>>
>>>>>
>>>>> FWIW, I think your suggestion is reasonable, but I'm also curious what
>>>>> the error handling would look like in ext4. Do you expect to the fail
>>>>> the higher level operation, for example? Cycle locks and retry, etc.?
>>>>
>>>> Originally, I wanted ext4_block_zero_page_range() to return a failure
>>>> to the higher level operation. However, unfortunately, after my testing
>>>> today, I discovered that even though we implement this, this series still
>>>> cannot resolve the issue. The corner case is:
>>>>
>>>> Assume we have a dirty folio covers both hole and unwritten mappings.
>>>>
>>>>    |- dirty folio  -|
>>>>    [hhhhhhhhuuuuuuuu]                h:hole, u:unwrtten
>>>>
>>>> If we punch the range of the hole, ext4_punch_hole()->
>>>> ext4_zero_partial_blocks() will zero out the first half of the dirty folio.
>>>> Then, ext4_iomap_buffered_zero_begin() will skip adding this dirty folio
>>>> since the target range is a hole. Finally, iomap_zero_range() will still
>>>> flush this whole folio and lead to deadlock during writeback the latter
>>>> half of the folio.
>>>>
>>>>>
>>>>> The reason I ask is because the folio_batch handling has come up through
>>>>> discussions on this series. My position so far has been to keep it as a
>>>>> separate allocation and to keep things simple since it is currently
>>>>> isolated to zero range, but that may change if the usage spills over to
>>>>> other operations (which seems expected at this point). I suspect that if
>>>>> a filesystem actually depends on this for correct behavior, that is
>>>>> another data point worth considering on that topic.
>>>>>
>>>>> So that has me wondering if it would be better/easier here to perhaps
>>>>> embed the batch in iomap_iter, or maybe as an incremental step put it on
>>>>> the stack in iomap_zero_range() and initialize the iomap_iter pointer
>>>>> there instead of doing the dynamic allocation (then the fill helper
>>>>> would set a flag to indicate the fs did pagecache lookup). Thoughts on
>>>>> something like that?
>>>>>
>>>>> Also IIUC ext4-on-iomap is still a WIP and review on this series seems
>>>>> to have mostly wound down. Any objection if the fix for that comes along
>>>>> as a followup patch rather than a rework of this series?
>>>>
>>>> It seems that we don't need to modify this series, we need to consider
>>>> other solutions to resolve this deadlock issue.
>>>>
>>>> In my v1 ext4-on-iomap series [1], I resolved this issue by moving all
>>>> instances of ext4_block_zero_page_range() out of the running journal
>>>> handle(please see patch 19-21). But I don't think this is a good solution
>>>> since it's complex and fragile. Besides, after commit c7fc0366c6562
>>>> ("ext4: partial zero eof block on unaligned inode size extension"), you
>>>> added more invocations of ext4_zero_partial_blocks(), and the situation
>>>> has become more complicated (Althrough I think the calls in the three
>>>> write_end callbacks can be removed).
>>>>
>>>> Besides, IIUC, it seems that ext4 doesn't need to flush dirty folios
>>>> over unwritten mappings before zeroing partial blocks. This is because
>>>> ext4 always zeroes the in-memory page cache before zeroing(e.g, in
>>>> ext4_setattr() and ext4_punch_hole()), it means if the target range is
>>>> still dirty and unwritten when calling ext4_block_zero_page_range(), it
>>>> must has already been zeroed. Was I missing something? Therefore, I was
>>>> wondering if there are any ways to prevent flushing in
>>>> iomap_zero_range()? Any ideas?
>>>>
>>>
>>> The commit 7d9b474ee4cc ("iomap: make zero range flush conditional on
>>> unwritten mappings") mentioned the following:
>>>
>>>   iomap_zero_range() flushes pagecache to mitigate consistency
>>>   problems with dirty pagecache and unwritten mappings. The flush is
>>>   unconditional over the entire range because checking pagecache state
>>>   after mapping lookup is racy with writeback and reclaim. There are
>>>   ways around this using iomap's mapping revalidation mechanism, but
>>>   this is not supported by all iomap based filesystems and so is not a
>>>   generic solution.
>>>
>>> Does the revalidation mechanism here refer to verifying the validity of
>>> the mapping through iomap_write_ops->iomap_valid()? IIUC, does this mean
>>> that if the filesystem implement the iomap_valid() interface, we can
>>> always avoid the iomap_zero_range() from flushing dirty folios back?
>>> Something like below:
>>>
>>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
>>> index 73772d34f502..ba71a6ed2f77 100644
>>> --- a/fs/iomap/buffered-io.c
>>> +++ b/fs/iomap/buffered-io.c
>>> @@ -1522,7 +1522,10 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
>>>
>>> 			if (range_dirty) {
>>> 				range_dirty = false;
>>> -				status = iomap_zero_iter_flush_and_stale(&iter);
>>> +				if (write_ops->iomap_valid)
>>> +					status = iomap_zero_iter(&iter, did_zero, write_ops);
>>> +				else
>>> +					status = iomap_zero_iter_flush_and_stale(&iter);
>>> 			} else {
>>> 				status = iomap_iter_advance_full(&iter);
>>> 			}
>>>
>>
>> The above diff will trigger
>> WARN_ON_ONCE(folio_pos(folio) > iter->inode->i_size) in iomap_zero_iter()
>> on XFS. I revised the 'diff' and ran xfstests with several main configs
>> on both XFS and ext4(with iomap infrastructure), and everything seems to
>> be working fine so far. What do you think?
>>
> 
> A couple things here.. First, I don't think it's quite enough to assume
> zeroing is safe just because a revalidation callback is defined.

Sorry, I can't think of any problems with this. Could you please provide
a more detailed explanation along with an example?

> I have
> multiple (old) prototypes around fixing zero range, one of which was
> centered around using ->iomap_valid(), so I do believe it's technically
> possible with further changes. That's what the above "There are ways
> around this using the revalidation mechanism" text was referring to
> generally.
> 
> That said, I don't think going down that path is the right solution
> here. I opted for the folio batch approach in favor of that one because
> it is more generic, for one. Based on a lot of the discussion around
> this series, it also seems to be more broadly useful. For example, there
> is potential to use for other operations, including buffered writes.

Hmm, I didn't follow your discussion so I don't get this. How does
buffered writes utilize this folio batch mechanism, and what are its
benefits? Could you please give more infomation?

> If
> that pans out (no guarantees of course), then the fill thing becomes
> more of a generic iomap step vs. something called by the fs. That likely
> means the flush this is trying to work around can also go away entirely.

It would be great if we could completely avoid this flush. I look
forward to seeing it.

> I.e., the flush is really just a fallback mechanism for fs' that don't
> care or know any better, since there is no guarantee the callback fills
> the batch and a flush is better than silent data corruption. So I'd> really like to avoid trying to reinvent things here if at all possible.

Yes, this makes sense to me.

> I'm curious if the tweaks proposed in the previous reply would be
> sufficient for ext4. If so, I can dig into that after rolling the next
> version of this series..
> 

I believe it is working now. Please see my reply for details.

Thanks,
Yi.




^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 3/7] iomap: optional zero range dirty folio processing
  2025-08-02  7:19             ` Zhang Yi
@ 2025-08-05 13:08               ` Brian Foster
  2025-08-06  3:10                 ` Zhang Yi
  0 siblings, 1 reply; 37+ messages in thread
From: Brian Foster @ 2025-08-05 13:08 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-fsdevel, linux-xfs, linux-mm, hch, willy, Darrick J. Wong,
	Ext4 Developers List

On Sat, Aug 02, 2025 at 03:19:54PM +0800, Zhang Yi wrote:
> On 2025/7/30 21:17, Brian Foster wrote:
> > On Sat, Jul 19, 2025 at 07:07:43PM +0800, Zhang Yi wrote:
> >> On 2025/7/18 21:48, Brian Foster wrote:
> >>> On Fri, Jul 18, 2025 at 07:30:10PM +0800, Zhang Yi wrote:
> >>>> On 2025/7/15 13:22, Darrick J. Wong wrote:
> >>>>> On Mon, Jul 14, 2025 at 04:41:18PM -0400, Brian Foster wrote:
> >>>>>> The only way zero range can currently process unwritten mappings
> >>>>>> with dirty pagecache is to check whether the range is dirty before
> >>>>>> mapping lookup and then flush when at least one underlying mapping
> >>>>>> is unwritten. This ordering is required to prevent iomap lookup from
> >>>>>> racing with folio writeback and reclaim.
> >>>>>>
> >>>>>> Since zero range can skip ranges of unwritten mappings that are
> >>>>>> clean in cache, this operation can be improved by allowing the
> >>>>>> filesystem to provide a set of dirty folios that require zeroing. In
> >>>>>> turn, rather than flush or iterate file offsets, zero range can
> >>>>>> iterate on folios in the batch and advance over clean or uncached
> >>>>>> ranges in between.
> >>>>>>
> >>>>>> Add a folio_batch in struct iomap and provide a helper for fs' to
> >>>>>
> >>>>> /me confused by the single quote; is this supposed to read:
> >>>>>
> >>>>> "...for the fs to populate..."?
> >>>>>
> >>>>> Either way the code changes look like a reasonable thing to do for the
> >>>>> pagecache (try to grab a bunch of dirty folios while XFS holds the
> >>>>> mapping lock) so
> >>>>>
> >>>>> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
> >>>>>
> >>>>> --D
> >>>>>
> >>>>>
> >>>>>> populate the batch at lookup time. Update the folio lookup path to
> >>>>>> return the next folio in the batch, if provided, and advance the
> >>>>>> iter if the folio starts beyond the current offset.
> >>>>>>
> >>>>>> Signed-off-by: Brian Foster <bfoster@redhat.com>
> >>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
> >>>>>> ---
> >>>>>>  fs/iomap/buffered-io.c | 89 +++++++++++++++++++++++++++++++++++++++---
> >>>>>>  fs/iomap/iter.c        |  6 +++
> >>>>>>  include/linux/iomap.h  |  4 ++
> >>>>>>  3 files changed, 94 insertions(+), 5 deletions(-)
> >>>>>>
> >>>>>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> >>>>>> index 38da2fa6e6b0..194e3cc0857f 100644
> >>>>>> --- a/fs/iomap/buffered-io.c
> >>>>>> +++ b/fs/iomap/buffered-io.c
> >>>> [...]
> >>>>>> @@ -1398,6 +1452,26 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
> >>>>>>  	return status;
> >>>>>>  }
> >>>>>>  
> >>>>>> +loff_t
> >>>>>> +iomap_fill_dirty_folios(
> >>>>>> +	struct iomap_iter	*iter,
> >>>>>> +	loff_t			offset,
> >>>>>> +	loff_t			length)
> >>>>>> +{
> >>>>>> +	struct address_space	*mapping = iter->inode->i_mapping;
> >>>>>> +	pgoff_t			start = offset >> PAGE_SHIFT;
> >>>>>> +	pgoff_t			end = (offset + length - 1) >> PAGE_SHIFT;
> >>>>>> +
> >>>>>> +	iter->fbatch = kmalloc(sizeof(struct folio_batch), GFP_KERNEL);
> >>>>>> +	if (!iter->fbatch)
> >>>>
> >>>> Hi, Brian!
> >>>>
> >>>> I think ext4 needs to be aware of this failure after it converts to use
> >>>> iomap infrastructure. It is because if we fail to add dirty folios to the
> >>>> fbatch, iomap_zero_range() will flush those unwritten and dirty range.
> >>>> This could potentially lead to a deadlock, as most calls to
> >>>> ext4_block_zero_page_range() occur under an active journal handle.
> >>>> Writeback operations under an active journal handle may result in circular
> >>>> waiting within journal transactions. So please return this error code, and
> >>>> then ext4 can interrupt zero operations to prevent deadlock.
> >>>>
> >>>
> >>> Hi Yi,
> >>>
> >>> Thanks for looking at this.
> >>>
> >>> Huh.. so the reason for falling back like this here is just that this
> >>> was considered an optional optimization, with the flush in
> >>> iomap_zero_range() being default fallback behavior. IIUC, what you're
> >>> saying means that the current zero range behavior without this series is
> >>> problematic for ext4-on-iomap..? 
> >>
> >> Yes.
> >>
> >>> If so, have you observed issues you can share details about?
> >>
> >> Sure.
> >>
> >> Before delving into the specific details of this issue, I would like
> >> to provide some background information on the rule that ext4 cannot
> >> wait for writeback in an active journal handle. If you are aware of
> >> this background, please skip this paragraph. During ext4 writing back
> >> the page cache, it may start a new journal handle to allocate blocks,
> >> update the disksize, and convert unwritten extents after the I/O is
> >> completed. When starting this new journal handle, if the current
> >> running journal transaction is in the process of being submitted or
> >> if the journal space is insufficient, it must wait for the ongoing
> >> transaction to be completed, but the prerequisite for this is that all
> >> currently running handles must be terminated. However, if we flush the
> >> page cache under an active journal handle, we cannot stop it, which
> >> may lead to a deadlock.
> >>
> > 
> > Ok, makes sense.
> > 
> >> Now, the issue I have observed occurs when I attempt to use
> >> iomap_zero_range() within ext4_block_zero_page_range(). My current
> >> implementation are below(based on the latest fs-next).
> >>
> >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> >> index 28547663e4fd..1a21667f3f7c 100644
> >> --- a/fs/ext4/inode.c
> >> +++ b/fs/ext4/inode.c
> >> @@ -4147,6 +4147,53 @@ static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
> >>  	return 0;
> >>  }
> >>
> >> +static int ext4_iomap_buffered_zero_begin(struct inode *inode, loff_t offset,
> >> +			loff_t length, unsigned int flags, struct iomap *iomap,
> >> +			struct iomap *srcmap)
> >> +{
> >> +	struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap);
> >> +	struct ext4_map_blocks map;
> >> +	u8 blkbits = inode->i_blkbits;
> >> +	int ret;
> >> +
> >> +	ret = ext4_emergency_state(inode->i_sb);
> >> +	if (unlikely(ret))
> >> +		return ret;
> >> +
> >> +	if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
> >> +		return -EINVAL;
> >> +
> >> +	/* Calculate the first and last logical blocks respectively. */
> >> +	map.m_lblk = offset >> blkbits;
> >> +	map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
> >> +			  EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1;
> >> +
> >> +	ret = ext4_map_blocks(NULL, inode, &map, 0);
> >> +	if (ret < 0)
> >> +		return ret;
> >> +
> >> +	/*
> >> +	 * Look up dirty folios for unwritten mappings within EOF. Providing
> >> +	 * this bypasses the flush iomap uses to trigger extent conversion
> >> +	 * when unwritten mappings have dirty pagecache in need of zeroing.
> >> +	 */
> >> +	if ((map.m_flags & EXT4_MAP_UNWRITTEN) &&
> >> +	    map.m_lblk < EXT4_B_TO_LBLK(inode, i_size_read(inode))) {
> >> +		loff_t end;
> >> +
> >> +		end = iomap_fill_dirty_folios(iter, map.m_lblk << blkbits,
> >> +					      map.m_len << blkbits);
> >> +		if ((end >> blkbits) < map.m_lblk + map.m_len)
> >> +			map.m_len = (end >> blkbits) - map.m_lblk;
> >> +	}
> >> +
> >> +	ext4_set_iomap(inode, iomap, &map, offset, length, flags);
> >> +	return 0;
> >> +}
> >> +
> >> +const struct iomap_ops ext4_iomap_buffered_zero_ops = {
> >> +	.iomap_begin = ext4_iomap_buffered_zero_begin,
> >> +};
> >>
> >>  const struct iomap_ops ext4_iomap_buffered_write_ops = {
> >>  	.iomap_begin = ext4_iomap_buffered_write_begin,
> >> @@ -4611,6 +4658,17 @@ static int __ext4_block_zero_page_range(handle_t *handle,
> >>  	return err;
> >>  }
> >>
> >> +static inline int ext4_iomap_zero_range(struct inode *inode, loff_t from,
> >> +					loff_t length)
> >> +{
> >> +	WARN_ON_ONCE(!inode_is_locked(inode) &&
> >> +		     !rwsem_is_locked(&inode->i_mapping->invalidate_lock));
> >> +
> >> +	return iomap_zero_range(inode, from, length, NULL,
> >> +				&ext4_iomap_buffered_zero_ops,
> >> +				&ext4_iomap_write_ops, NULL);
> >> +}
> >> +
> >>  /*
> >>   * ext4_block_zero_page_range() zeros out a mapping of length 'length'
> >>   * starting from file offset 'from'.  The range to be zero'd must
> >> @@ -4636,6 +4694,8 @@ static int ext4_block_zero_page_range(handle_t *handle,
> >>  	if (IS_DAX(inode)) {
> >>  		return dax_zero_range(inode, from, length, NULL,
> >>  				      &ext4_iomap_ops);
> >> +	} else if (ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP)) {
> >> +		return ext4_iomap_zero_range(inode, from, length);
> >>  	}
> >>  	return __ext4_block_zero_page_range(handle, mapping, from, length);
> >>  }
> >>
> >> The problem is most calls to ext4_block_zero_page_range() occur under
> >> an active journal handle, so I can reproduce the deadlock issue easily
> >> without this series.
> >>
> >>>
> >>> FWIW, I think your suggestion is reasonable, but I'm also curious what
> >>> the error handling would look like in ext4. Do you expect to the fail
> >>> the higher level operation, for example? Cycle locks and retry, etc.?
> >>
> >> Originally, I wanted ext4_block_zero_page_range() to return a failure
> >> to the higher level operation. However, unfortunately, after my testing
> >> today, I discovered that even though we implement this, this series still
> >> cannot resolve the issue. The corner case is:
> >>
> >> Assume we have a dirty folio covers both hole and unwritten mappings.
> >>
> >>    |- dirty folio  -|
> >>    [hhhhhhhhuuuuuuuu]                h:hole, u:unwrtten
> >>
> >> If we punch the range of the hole, ext4_punch_hole()->
> >> ext4_zero_partial_blocks() will zero out the first half of the dirty folio.
> >> Then, ext4_iomap_buffered_zero_begin() will skip adding this dirty folio
> >> since the target range is a hole. Finally, iomap_zero_range() will still
> >> flush this whole folio and lead to deadlock during writeback the latter
> >> half of the folio.
> >>
> > 
> > Hmm.. Ok. So it seems there are at least a couple ways around this
> > particular quirk. I suspect one is that you could just call the fill
> > helper in the hole case as well, but that's kind of a hack and not
> > really intended use.
> > 
> > The other way goes back to the fact that the flush for the hole case was
> > kind of a corner case hack in the first place. The original comment for
> > that seems to have been dropped, but see commit 7d9b474ee4cc ("iomap:
> > make zero range flush conditional on unwritten mappings") for reference
> > to the original intent.
> > 
> > I'd have to go back and investigate if something regresses with that
> > taken out, but my recollection is that was something that needed proper
> > fixing eventually anyways. I'm particularly wondering if that is no
> > longer an issue now that pagecache_isize_extended() handles the post-eof
> > zeroing (the caveat being we might just need to call it in some
> > additional size extension cases besides just setattr/truncate).
> 
> Yeah, I agree with you. I suppose the post-EOF partial folio zeroing in
> pagecache_isize_extended() should work.
> 

Ok..

> > 
> >>>
> >>> The reason I ask is because the folio_batch handling has come up through
> >>> discussions on this series. My position so far has been to keep it as a
> >>> separate allocation and to keep things simple since it is currently
> >>> isolated to zero range, but that may change if the usage spills over to
> >>> other operations (which seems expected at this point). I suspect that if
> >>> a filesystem actually depends on this for correct behavior, that is
> >>> another data point worth considering on that topic.
> >>>
> >>> So that has me wondering if it would be better/easier here to perhaps
> >>> embed the batch in iomap_iter, or maybe as an incremental step put it on
> >>> the stack in iomap_zero_range() and initialize the iomap_iter pointer
> >>> there instead of doing the dynamic allocation (then the fill helper
> >>> would set a flag to indicate the fs did pagecache lookup). Thoughts on
> >>> something like that?
> >>>
> >>> Also IIUC ext4-on-iomap is still a WIP and review on this series seems
> >>> to have mostly wound down. Any objection if the fix for that comes along
> >>> as a followup patch rather than a rework of this series?
> >>
> >> It seems that we don't need to modify this series, we need to consider
> >> other solutions to resolve this deadlock issue.
> >>
> >> In my v1 ext4-on-iomap series [1], I resolved this issue by moving all
> >> instances of ext4_block_zero_page_range() out of the running journal
> >> handle(please see patch 19-21). But I don't think this is a good solution
> >> since it's complex and fragile. Besides, after commit c7fc0366c6562
> >> ("ext4: partial zero eof block on unaligned inode size extension"), you
> >> added more invocations of ext4_zero_partial_blocks(), and the situation
> >> has become more complicated (Althrough I think the calls in the three
> >> write_end callbacks can be removed).
> >>
> >> Besides, IIUC, it seems that ext4 doesn't need to flush dirty folios
> >> over unwritten mappings before zeroing partial blocks. This is because
> >> ext4 always zeroes the in-memory page cache before zeroing(e.g, in
> >> ext4_setattr() and ext4_punch_hole()), it means if the target range is
> >> still dirty and unwritten when calling ext4_block_zero_page_range(), it
> >> must has already been zeroed. Was I missing something? Therefore, I was
> >> wondering if there are any ways to prevent flushing in
> >> iomap_zero_range()? Any ideas?
> > 
> > It's certainly possible that the quirk fixed by the flush the hole case
> > was never a problem on ext4, if that's what you mean. Most of the
> > testing for this was on XFS since ext4 hadn't used iomap for buffered
> > writes.
> > 
> > At the end of the day, the batch mechanism is intended to facilitate
> > avoiding the flush entirely. I'm still paging things back in here.. but
> > if we had two smallish changes to this code path to 1. eliminate the
> > dynamic folio_batch allocation and 2. drop the flush on hole mapping
> > case, would that address the issues with iomap zero range for ext4?
> > 
> 
> Thank you for looking at this!
> 
> I made a simple modification to the iomap_zero_range() function based
> on the second solution you mentioned, then tested it using kvm-xfstests
> these days. This solution works fine on ext4 and I don't find any other
> risks by now. (Since my testing environment has sufficient memory, I
> have not yet handled the case of memory allocation failure).
> 

Great, thanks for evaluating that. I've been playing around with the
exact same change the past few days. As it turns out, this still breaks
something with XFS. I've narrowed it down to an interaction between a
large eof folio that fails to split on truncate due to being dirty, COW
prealloc and insert range racing with writeback in such a way that this
results in a mapped post-eof block on the file. Technically that isn't
the end of the world so long as it is zeroed, but a subsequent zero
range can warn if the folio is reclaimed and we now end up with a new
one that starts beyond EOF, because those folios don't write back.

I have a mix of hacks that seems to address the generic/363 failure, but
it still needs further testing and analysis to unwind my mess of various
experiments and whatnot. ;P

> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -1520,7 +1520,7 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
> 		     srcmap->type == IOMAP_UNWRITTEN)) {
> 			s64 status;
> 
> -			if (range_dirty) {
> +			if (range_dirty && srcmap->type == IOMAP_UNWRITTEN) {
> 				range_dirty = false;
> 				status = iomap_zero_iter_flush_and_stale(&iter);
> 			} else {
> 
> Another thing I want to mention (although there are no real issues at
> the moment, I still want to mention it) is that there appears to be
> no consistency guarantee between the lookup of the mapping and the
> follo_batch. For example, assume we have a file which contains two
> dirty folio and two unwritten extents, one folio corresponds to one
> extent. We zero out these two folios.
> 
>     | dirty folio 1  || dirty folio 2   |
>     [uuuuuuuuuuuuuuuu][uuuuuuuuuuuuuuuuu]
> 
> In the first call to ->iomap_begin(), we get the unwritten extent 1.
> At the same time, another thread writes back folio 1 and clears this
> folio, so this folio will not be added to the follo_batch. Then
> iomap_zero_range() will still flush those two folios. When flushing
> the second folio, there is still a risk of deadlock due to changes in
> metadata.
> 

Hmm.. not sure I follow the example. The folio batch should include the
folio in any case other than where it can look up, lock it and confirm
it is clean. If the folio is clean and thus not included, the iomap
logic should still see the empty folio batch and skip over the mapping
if unwritten. (I want to replace this with a flag to address your memory
allocation concern, but that is orthogonal to this logic.)

Of course this should happen under appropriate fs locks such that iomap
either sees the folio in dirty/writeback state where the mapping is
unwritten, or if the folio has been cleaned, the mapping is reported as
written.

If I'm still missing something with your example above, can you
elaborate a bit further? Thanks.

Brian

> However, since ext4 currently uses this interface only to zero out
> partial block, so this situation will not happen, but if the usage
> changes in the future, we should be very careful about this point.
> So in the future, I hope to have a more reliable method to avoid
> flushing in iomap_zero_range().
> 
> Therefore, at the moment, I think that solving the problem through
> these two points is feasible (I hope I haven't missed anything :-) ),
> though it is somewhat fragile. What do other ext4 developers think?
> 
> Thanks,
> Yi.
> 



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 3/7] iomap: optional zero range dirty folio processing
  2025-08-05 13:08               ` Brian Foster
@ 2025-08-06  3:10                 ` Zhang Yi
  2025-08-06 13:25                   ` Brian Foster
  0 siblings, 1 reply; 37+ messages in thread
From: Zhang Yi @ 2025-08-06  3:10 UTC (permalink / raw)
  To: Brian Foster
  Cc: linux-fsdevel, linux-xfs, linux-mm, hch, willy, Darrick J. Wong,
	Ext4 Developers List

On 2025/8/5 21:08, Brian Foster wrote:
> On Sat, Aug 02, 2025 at 03:19:54PM +0800, Zhang Yi wrote:
>> On 2025/7/30 21:17, Brian Foster wrote:
>>> On Sat, Jul 19, 2025 at 07:07:43PM +0800, Zhang Yi wrote:
>>>> On 2025/7/18 21:48, Brian Foster wrote:
>>>>> On Fri, Jul 18, 2025 at 07:30:10PM +0800, Zhang Yi wrote:
>>>>>> On 2025/7/15 13:22, Darrick J. Wong wrote:
>>>>>>> On Mon, Jul 14, 2025 at 04:41:18PM -0400, Brian Foster wrote:
>>>>>>>> The only way zero range can currently process unwritten mappings
>>>>>>>> with dirty pagecache is to check whether the range is dirty before
>>>>>>>> mapping lookup and then flush when at least one underlying mapping
>>>>>>>> is unwritten. This ordering is required to prevent iomap lookup from
>>>>>>>> racing with folio writeback and reclaim.
>>>>>>>>
>>>>>>>> Since zero range can skip ranges of unwritten mappings that are
>>>>>>>> clean in cache, this operation can be improved by allowing the
>>>>>>>> filesystem to provide a set of dirty folios that require zeroing. In
>>>>>>>> turn, rather than flush or iterate file offsets, zero range can
>>>>>>>> iterate on folios in the batch and advance over clean or uncached
>>>>>>>> ranges in between.
>>>>>>>>
>>>>>>>> Add a folio_batch in struct iomap and provide a helper for fs' to
>>>>>>>
>>>>>>> /me confused by the single quote; is this supposed to read:
>>>>>>>
>>>>>>> "...for the fs to populate..."?
>>>>>>>
>>>>>>> Either way the code changes look like a reasonable thing to do for the
>>>>>>> pagecache (try to grab a bunch of dirty folios while XFS holds the
>>>>>>> mapping lock) so
>>>>>>>
>>>>>>> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
>>>>>>>
>>>>>>> --D
>>>>>>>
>>>>>>>
>>>>>>>> populate the batch at lookup time. Update the folio lookup path to
>>>>>>>> return the next folio in the batch, if provided, and advance the
>>>>>>>> iter if the folio starts beyond the current offset.
>>>>>>>>
>>>>>>>> Signed-off-by: Brian Foster <bfoster@redhat.com>
>>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>>>>>>>> ---
>>>>>>>>  fs/iomap/buffered-io.c | 89 +++++++++++++++++++++++++++++++++++++++---
>>>>>>>>  fs/iomap/iter.c        |  6 +++
>>>>>>>>  include/linux/iomap.h  |  4 ++
>>>>>>>>  3 files changed, 94 insertions(+), 5 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
>>>>>>>> index 38da2fa6e6b0..194e3cc0857f 100644
>>>>>>>> --- a/fs/iomap/buffered-io.c
>>>>>>>> +++ b/fs/iomap/buffered-io.c
>>>>>> [...]
>>>>>>>> @@ -1398,6 +1452,26 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
>>>>>>>>  	return status;
>>>>>>>>  }
>>>>>>>>  
>>>>>>>> +loff_t
>>>>>>>> +iomap_fill_dirty_folios(
>>>>>>>> +	struct iomap_iter	*iter,
>>>>>>>> +	loff_t			offset,
>>>>>>>> +	loff_t			length)
>>>>>>>> +{
>>>>>>>> +	struct address_space	*mapping = iter->inode->i_mapping;
>>>>>>>> +	pgoff_t			start = offset >> PAGE_SHIFT;
>>>>>>>> +	pgoff_t			end = (offset + length - 1) >> PAGE_SHIFT;
>>>>>>>> +
>>>>>>>> +	iter->fbatch = kmalloc(sizeof(struct folio_batch), GFP_KERNEL);
>>>>>>>> +	if (!iter->fbatch)
>>>>>>
>>>>>> Hi, Brian!
>>>>>>
>>>>>> I think ext4 needs to be aware of this failure after it converts to use
>>>>>> iomap infrastructure. It is because if we fail to add dirty folios to the
>>>>>> fbatch, iomap_zero_range() will flush those unwritten and dirty range.
>>>>>> This could potentially lead to a deadlock, as most calls to
>>>>>> ext4_block_zero_page_range() occur under an active journal handle.
>>>>>> Writeback operations under an active journal handle may result in circular
>>>>>> waiting within journal transactions. So please return this error code, and
>>>>>> then ext4 can interrupt zero operations to prevent deadlock.
>>>>>>
>>>>>
>>>>> Hi Yi,
>>>>>
>>>>> Thanks for looking at this.
>>>>>
>>>>> Huh.. so the reason for falling back like this here is just that this
>>>>> was considered an optional optimization, with the flush in
>>>>> iomap_zero_range() being default fallback behavior. IIUC, what you're
>>>>> saying means that the current zero range behavior without this series is
>>>>> problematic for ext4-on-iomap..? 
>>>>
>>>> Yes.
>>>>
>>>>> If so, have you observed issues you can share details about?
>>>>
>>>> Sure.
>>>>
>>>> Before delving into the specific details of this issue, I would like
>>>> to provide some background information on the rule that ext4 cannot
>>>> wait for writeback in an active journal handle. If you are aware of
>>>> this background, please skip this paragraph. During ext4 writing back
>>>> the page cache, it may start a new journal handle to allocate blocks,
>>>> update the disksize, and convert unwritten extents after the I/O is
>>>> completed. When starting this new journal handle, if the current
>>>> running journal transaction is in the process of being submitted or
>>>> if the journal space is insufficient, it must wait for the ongoing
>>>> transaction to be completed, but the prerequisite for this is that all
>>>> currently running handles must be terminated. However, if we flush the
>>>> page cache under an active journal handle, we cannot stop it, which
>>>> may lead to a deadlock.
>>>>
>>>
>>> Ok, makes sense.
>>>
>>>> Now, the issue I have observed occurs when I attempt to use
>>>> iomap_zero_range() within ext4_block_zero_page_range(). My current
>>>> implementation are below(based on the latest fs-next).
>>>>
>>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>>> index 28547663e4fd..1a21667f3f7c 100644
>>>> --- a/fs/ext4/inode.c
>>>> +++ b/fs/ext4/inode.c
>>>> @@ -4147,6 +4147,53 @@ static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
>>>>  	return 0;
>>>>  }
>>>>
>>>> +static int ext4_iomap_buffered_zero_begin(struct inode *inode, loff_t offset,
>>>> +			loff_t length, unsigned int flags, struct iomap *iomap,
>>>> +			struct iomap *srcmap)
>>>> +{
>>>> +	struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap);
>>>> +	struct ext4_map_blocks map;
>>>> +	u8 blkbits = inode->i_blkbits;
>>>> +	int ret;
>>>> +
>>>> +	ret = ext4_emergency_state(inode->i_sb);
>>>> +	if (unlikely(ret))
>>>> +		return ret;
>>>> +
>>>> +	if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
>>>> +		return -EINVAL;
>>>> +
>>>> +	/* Calculate the first and last logical blocks respectively. */
>>>> +	map.m_lblk = offset >> blkbits;
>>>> +	map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
>>>> +			  EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1;
>>>> +
>>>> +	ret = ext4_map_blocks(NULL, inode, &map, 0);
>>>> +	if (ret < 0)
>>>> +		return ret;
>>>> +
>>>> +	/*
>>>> +	 * Look up dirty folios for unwritten mappings within EOF. Providing
>>>> +	 * this bypasses the flush iomap uses to trigger extent conversion
>>>> +	 * when unwritten mappings have dirty pagecache in need of zeroing.
>>>> +	 */
>>>> +	if ((map.m_flags & EXT4_MAP_UNWRITTEN) &&
>>>> +	    map.m_lblk < EXT4_B_TO_LBLK(inode, i_size_read(inode))) {
>>>> +		loff_t end;
>>>> +
>>>> +		end = iomap_fill_dirty_folios(iter, map.m_lblk << blkbits,
>>>> +					      map.m_len << blkbits);
>>>> +		if ((end >> blkbits) < map.m_lblk + map.m_len)
>>>> +			map.m_len = (end >> blkbits) - map.m_lblk;
>>>> +	}
>>>> +
>>>> +	ext4_set_iomap(inode, iomap, &map, offset, length, flags);
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +const struct iomap_ops ext4_iomap_buffered_zero_ops = {
>>>> +	.iomap_begin = ext4_iomap_buffered_zero_begin,
>>>> +};
>>>>
>>>>  const struct iomap_ops ext4_iomap_buffered_write_ops = {
>>>>  	.iomap_begin = ext4_iomap_buffered_write_begin,
>>>> @@ -4611,6 +4658,17 @@ static int __ext4_block_zero_page_range(handle_t *handle,
>>>>  	return err;
>>>>  }
>>>>
>>>> +static inline int ext4_iomap_zero_range(struct inode *inode, loff_t from,
>>>> +					loff_t length)
>>>> +{
>>>> +	WARN_ON_ONCE(!inode_is_locked(inode) &&
>>>> +		     !rwsem_is_locked(&inode->i_mapping->invalidate_lock));
>>>> +
>>>> +	return iomap_zero_range(inode, from, length, NULL,
>>>> +				&ext4_iomap_buffered_zero_ops,
>>>> +				&ext4_iomap_write_ops, NULL);
>>>> +}
>>>> +
>>>>  /*
>>>>   * ext4_block_zero_page_range() zeros out a mapping of length 'length'
>>>>   * starting from file offset 'from'.  The range to be zero'd must
>>>> @@ -4636,6 +4694,8 @@ static int ext4_block_zero_page_range(handle_t *handle,
>>>>  	if (IS_DAX(inode)) {
>>>>  		return dax_zero_range(inode, from, length, NULL,
>>>>  				      &ext4_iomap_ops);
>>>> +	} else if (ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP)) {
>>>> +		return ext4_iomap_zero_range(inode, from, length);
>>>>  	}
>>>>  	return __ext4_block_zero_page_range(handle, mapping, from, length);
>>>>  }
>>>>
>>>> The problem is most calls to ext4_block_zero_page_range() occur under
>>>> an active journal handle, so I can reproduce the deadlock issue easily
>>>> without this series.
>>>>
>>>>>
>>>>> FWIW, I think your suggestion is reasonable, but I'm also curious what
>>>>> the error handling would look like in ext4. Do you expect to the fail
>>>>> the higher level operation, for example? Cycle locks and retry, etc.?
>>>>
>>>> Originally, I wanted ext4_block_zero_page_range() to return a failure
>>>> to the higher level operation. However, unfortunately, after my testing
>>>> today, I discovered that even though we implement this, this series still
>>>> cannot resolve the issue. The corner case is:
>>>>
>>>> Assume we have a dirty folio covers both hole and unwritten mappings.
>>>>
>>>>    |- dirty folio  -|
>>>>    [hhhhhhhhuuuuuuuu]                h:hole, u:unwrtten
>>>>
>>>> If we punch the range of the hole, ext4_punch_hole()->
>>>> ext4_zero_partial_blocks() will zero out the first half of the dirty folio.
>>>> Then, ext4_iomap_buffered_zero_begin() will skip adding this dirty folio
>>>> since the target range is a hole. Finally, iomap_zero_range() will still
>>>> flush this whole folio and lead to deadlock during writeback the latter
>>>> half of the folio.
>>>>
>>>
>>> Hmm.. Ok. So it seems there are at least a couple ways around this
>>> particular quirk. I suspect one is that you could just call the fill
>>> helper in the hole case as well, but that's kind of a hack and not
>>> really intended use.
>>>
>>> The other way goes back to the fact that the flush for the hole case was
>>> kind of a corner case hack in the first place. The original comment for
>>> that seems to have been dropped, but see commit 7d9b474ee4cc ("iomap:
>>> make zero range flush conditional on unwritten mappings") for reference
>>> to the original intent.
>>>
>>> I'd have to go back and investigate if something regresses with that
>>> taken out, but my recollection is that was something that needed proper
>>> fixing eventually anyways. I'm particularly wondering if that is no
>>> longer an issue now that pagecache_isize_extended() handles the post-eof
>>> zeroing (the caveat being we might just need to call it in some
>>> additional size extension cases besides just setattr/truncate).
>>
>> Yeah, I agree with you. I suppose the post-EOF partial folio zeroing in
>> pagecache_isize_extended() should work.
>>
> 
> Ok..
> 
>>>
>>>>>
>>>>> The reason I ask is because the folio_batch handling has come up through
>>>>> discussions on this series. My position so far has been to keep it as a
>>>>> separate allocation and to keep things simple since it is currently
>>>>> isolated to zero range, but that may change if the usage spills over to
>>>>> other operations (which seems expected at this point). I suspect that if
>>>>> a filesystem actually depends on this for correct behavior, that is
>>>>> another data point worth considering on that topic.
>>>>>
>>>>> So that has me wondering if it would be better/easier here to perhaps
>>>>> embed the batch in iomap_iter, or maybe as an incremental step put it on
>>>>> the stack in iomap_zero_range() and initialize the iomap_iter pointer
>>>>> there instead of doing the dynamic allocation (then the fill helper
>>>>> would set a flag to indicate the fs did pagecache lookup). Thoughts on
>>>>> something like that?
>>>>>
>>>>> Also IIUC ext4-on-iomap is still a WIP and review on this series seems
>>>>> to have mostly wound down. Any objection if the fix for that comes along
>>>>> as a followup patch rather than a rework of this series?
>>>>
>>>> It seems that we don't need to modify this series, we need to consider
>>>> other solutions to resolve this deadlock issue.
>>>>
>>>> In my v1 ext4-on-iomap series [1], I resolved this issue by moving all
>>>> instances of ext4_block_zero_page_range() out of the running journal
>>>> handle(please see patch 19-21). But I don't think this is a good solution
>>>> since it's complex and fragile. Besides, after commit c7fc0366c6562
>>>> ("ext4: partial zero eof block on unaligned inode size extension"), you
>>>> added more invocations of ext4_zero_partial_blocks(), and the situation
>>>> has become more complicated (Althrough I think the calls in the three
>>>> write_end callbacks can be removed).
>>>>
>>>> Besides, IIUC, it seems that ext4 doesn't need to flush dirty folios
>>>> over unwritten mappings before zeroing partial blocks. This is because
>>>> ext4 always zeroes the in-memory page cache before zeroing(e.g, in
>>>> ext4_setattr() and ext4_punch_hole()), it means if the target range is
>>>> still dirty and unwritten when calling ext4_block_zero_page_range(), it
>>>> must has already been zeroed. Was I missing something? Therefore, I was
>>>> wondering if there are any ways to prevent flushing in
>>>> iomap_zero_range()? Any ideas?
>>>
>>> It's certainly possible that the quirk fixed by the flush the hole case
>>> was never a problem on ext4, if that's what you mean. Most of the
>>> testing for this was on XFS since ext4 hadn't used iomap for buffered
>>> writes.
>>>
>>> At the end of the day, the batch mechanism is intended to facilitate
>>> avoiding the flush entirely. I'm still paging things back in here.. but
>>> if we had two smallish changes to this code path to 1. eliminate the
>>> dynamic folio_batch allocation and 2. drop the flush on hole mapping
>>> case, would that address the issues with iomap zero range for ext4?
>>>
>>
>> Thank you for looking at this!
>>
>> I made a simple modification to the iomap_zero_range() function based
>> on the second solution you mentioned, then tested it using kvm-xfstests
>> these days. This solution works fine on ext4 and I don't find any other
>> risks by now. (Since my testing environment has sufficient memory, I
>> have not yet handled the case of memory allocation failure).
>>
> 
> Great, thanks for evaluating that. I've been playing around with the
> exact same change the past few days. As it turns out, this still breaks
> something with XFS. I've narrowed it down to an interaction between a
> large eof folio that fails to split on truncate due to being dirty, COW
> prealloc and insert range racing with writeback in such a way that this
> results in a mapped post-eof block on the file. Technically that isn't
> the end of the world so long as it is zeroed, but a subsequent zero
> range can warn if the folio is reclaimed and we now end up with a new
> one that starts beyond EOF, because those folios don't write back.
> 
> I have a mix of hacks that seems to address the generic/363 failure, but
> it still needs further testing and analysis to unwind my mess of various
> experiments and whatnot. ;P

OK, thanks for debugging and analyzing this.

> 
>> --- a/fs/iomap/buffered-io.c
>> +++ b/fs/iomap/buffered-io.c
>> @@ -1520,7 +1520,7 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
>> 		     srcmap->type == IOMAP_UNWRITTEN)) {
>> 			s64 status;
>>
>> -			if (range_dirty) {
>> +			if (range_dirty && srcmap->type == IOMAP_UNWRITTEN) {
>> 				range_dirty = false;
>> 				status = iomap_zero_iter_flush_and_stale(&iter);
>> 			} else {
>>
>> Another thing I want to mention (although there are no real issues at
>> the moment, I still want to mention it) is that there appears to be
>> no consistency guarantee between the lookup of the mapping and the
>> follo_batch. For example, assume we have a file which contains two
>> dirty folio and two unwritten extents, one folio corresponds to one
>> extent. We zero out these two folios.
>>
>>     | dirty folio 1  || dirty folio 2   |
>>     [uuuuuuuuuuuuuuuu][uuuuuuuuuuuuuuuuu]
>>
>> In the first call to ->iomap_begin(), we get the unwritten extent 1.
>> At the same time, another thread writes back folio 1 and clears this
>> folio, so this folio will not be added to the follo_batch. Then
>> iomap_zero_range() will still flush those two folios. When flushing
>> the second folio, there is still a risk of deadlock due to changes in
>> metadata.
>>
> 
> Hmm.. not sure I follow the example. The folio batch should include the
> folio in any case other than where it can look up, lock it and confirm
> it is clean. If the folio is clean and thus not included, the iomap
> logic should still see the empty folio batch and skip over the mapping
> if unwritten. (I want to replace this with a flag to address your memory
> allocation concern, but that is orthogonal to this logic.)
> 
> Of course this should happen under appropriate fs locks such that iomap
> either sees the folio in dirty/writeback state where the mapping is
> unwritten, or if the folio has been cleaned, the mapping is reported as
> written.
> 
> If I'm still missing something with your example above, can you
> elaborate a bit further? Thanks.
> 

Sorry for not make things clear. The race condition is the following,

zero range                            sync_file_range
iomap_zero_range() //folio 1+2
 range_dirty = filemap_range_needs_writeback()
 //range_dirty is set to 'ture'
 iomap_iter()
   ext4_iomap_buffer_zero_begin()
     ext4_map_blocks()
     //get unwritten extent 1
                                      sync_file_range() //folio 1
                                        iomap_writepages()
                                        ...
                                        iomap_finish_ioend()
                                          folio_end_writeback()
                                          //clear folio 1, and
                                          //extent 1 becomes written
     iomap_fill_dirty_folios()
     //do not add folio 1 to batch
  iomap_zero_iter_flush_and_stale()
  //!fbatch && IOMAP_UNWRITTEN && range_dirty
  //flush folio 1+2, folio 2 is still dirty, then deadlock

Besides, if the range of folio 1 is initially clean and unwritten
(folio 2 is still dirty), the flush can also be triggered without the
concurrent sync_file_range.

The problem is that we initially checked `range_dirty` only once, and
if was done for the entire zero range, instead of checking it each
iteration, and there is no fs lock can prevent a concurrent write-back.
Perhaps we need to check 'dirty_range' for each iteration?

Regards,
Yi.



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 3/7] iomap: optional zero range dirty folio processing
  2025-08-06  3:10                 ` Zhang Yi
@ 2025-08-06 13:25                   ` Brian Foster
  2025-08-07  4:58                     ` Zhang Yi
  0 siblings, 1 reply; 37+ messages in thread
From: Brian Foster @ 2025-08-06 13:25 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-fsdevel, linux-xfs, linux-mm, hch, willy, Darrick J. Wong,
	Ext4 Developers List

On Wed, Aug 06, 2025 at 11:10:30AM +0800, Zhang Yi wrote:
> On 2025/8/5 21:08, Brian Foster wrote:
> > On Sat, Aug 02, 2025 at 03:19:54PM +0800, Zhang Yi wrote:
> >> On 2025/7/30 21:17, Brian Foster wrote:
> >>> On Sat, Jul 19, 2025 at 07:07:43PM +0800, Zhang Yi wrote:
> >>>> On 2025/7/18 21:48, Brian Foster wrote:
> >>>>> On Fri, Jul 18, 2025 at 07:30:10PM +0800, Zhang Yi wrote:
> >>>>>> On 2025/7/15 13:22, Darrick J. Wong wrote:
> >>>>>>> On Mon, Jul 14, 2025 at 04:41:18PM -0400, Brian Foster wrote:
> >>>>>>>> The only way zero range can currently process unwritten mappings
> >>>>>>>> with dirty pagecache is to check whether the range is dirty before
> >>>>>>>> mapping lookup and then flush when at least one underlying mapping
> >>>>>>>> is unwritten. This ordering is required to prevent iomap lookup from
> >>>>>>>> racing with folio writeback and reclaim.
> >>>>>>>>
> >>>>>>>> Since zero range can skip ranges of unwritten mappings that are
> >>>>>>>> clean in cache, this operation can be improved by allowing the
> >>>>>>>> filesystem to provide a set of dirty folios that require zeroing. In
> >>>>>>>> turn, rather than flush or iterate file offsets, zero range can
> >>>>>>>> iterate on folios in the batch and advance over clean or uncached
> >>>>>>>> ranges in between.
> >>>>>>>>
> >>>>>>>> Add a folio_batch in struct iomap and provide a helper for fs' to
> >>>>>>>
> >>>>>>> /me confused by the single quote; is this supposed to read:
> >>>>>>>
> >>>>>>> "...for the fs to populate..."?
> >>>>>>>
> >>>>>>> Either way the code changes look like a reasonable thing to do for the
> >>>>>>> pagecache (try to grab a bunch of dirty folios while XFS holds the
> >>>>>>> mapping lock) so
> >>>>>>>
> >>>>>>> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
> >>>>>>>
> >>>>>>> --D
> >>>>>>>
> >>>>>>>
> >>>>>>>> populate the batch at lookup time. Update the folio lookup path to
> >>>>>>>> return the next folio in the batch, if provided, and advance the
> >>>>>>>> iter if the folio starts beyond the current offset.
> >>>>>>>>
> >>>>>>>> Signed-off-by: Brian Foster <bfoster@redhat.com>
> >>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
> >>>>>>>> ---
> >>>>>>>>  fs/iomap/buffered-io.c | 89 +++++++++++++++++++++++++++++++++++++++---
> >>>>>>>>  fs/iomap/iter.c        |  6 +++
> >>>>>>>>  include/linux/iomap.h  |  4 ++
> >>>>>>>>  3 files changed, 94 insertions(+), 5 deletions(-)
> >>>>>>>>
> >>>>>>>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> >>>>>>>> index 38da2fa6e6b0..194e3cc0857f 100644
> >>>>>>>> --- a/fs/iomap/buffered-io.c
> >>>>>>>> +++ b/fs/iomap/buffered-io.c
> >>>>>> [...]
> >>>>>>>> @@ -1398,6 +1452,26 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
> >>>>>>>>  	return status;
> >>>>>>>>  }
> >>>>>>>>  
> >>>>>>>> +loff_t
> >>>>>>>> +iomap_fill_dirty_folios(
> >>>>>>>> +	struct iomap_iter	*iter,
> >>>>>>>> +	loff_t			offset,
> >>>>>>>> +	loff_t			length)
> >>>>>>>> +{
> >>>>>>>> +	struct address_space	*mapping = iter->inode->i_mapping;
> >>>>>>>> +	pgoff_t			start = offset >> PAGE_SHIFT;
> >>>>>>>> +	pgoff_t			end = (offset + length - 1) >> PAGE_SHIFT;
> >>>>>>>> +
> >>>>>>>> +	iter->fbatch = kmalloc(sizeof(struct folio_batch), GFP_KERNEL);
> >>>>>>>> +	if (!iter->fbatch)
> >>>>>>
> >>>>>> Hi, Brian!
> >>>>>>
> >>>>>> I think ext4 needs to be aware of this failure after it converts to use
> >>>>>> iomap infrastructure. It is because if we fail to add dirty folios to the
> >>>>>> fbatch, iomap_zero_range() will flush those unwritten and dirty range.
> >>>>>> This could potentially lead to a deadlock, as most calls to
> >>>>>> ext4_block_zero_page_range() occur under an active journal handle.
> >>>>>> Writeback operations under an active journal handle may result in circular
> >>>>>> waiting within journal transactions. So please return this error code, and
> >>>>>> then ext4 can interrupt zero operations to prevent deadlock.
> >>>>>>
> >>>>>
> >>>>> Hi Yi,
> >>>>>
> >>>>> Thanks for looking at this.
> >>>>>
> >>>>> Huh.. so the reason for falling back like this here is just that this
> >>>>> was considered an optional optimization, with the flush in
> >>>>> iomap_zero_range() being default fallback behavior. IIUC, what you're
> >>>>> saying means that the current zero range behavior without this series is
> >>>>> problematic for ext4-on-iomap..? 
> >>>>
> >>>> Yes.
> >>>>
> >>>>> If so, have you observed issues you can share details about?
> >>>>
> >>>> Sure.
> >>>>
> >>>> Before delving into the specific details of this issue, I would like
> >>>> to provide some background information on the rule that ext4 cannot
> >>>> wait for writeback in an active journal handle. If you are aware of
> >>>> this background, please skip this paragraph. During ext4 writing back
> >>>> the page cache, it may start a new journal handle to allocate blocks,
> >>>> update the disksize, and convert unwritten extents after the I/O is
> >>>> completed. When starting this new journal handle, if the current
> >>>> running journal transaction is in the process of being submitted or
> >>>> if the journal space is insufficient, it must wait for the ongoing
> >>>> transaction to be completed, but the prerequisite for this is that all
> >>>> currently running handles must be terminated. However, if we flush the
> >>>> page cache under an active journal handle, we cannot stop it, which
> >>>> may lead to a deadlock.
> >>>>
> >>>
> >>> Ok, makes sense.
> >>>
> >>>> Now, the issue I have observed occurs when I attempt to use
> >>>> iomap_zero_range() within ext4_block_zero_page_range(). My current
> >>>> implementation are below(based on the latest fs-next).
> >>>>
> >>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> >>>> index 28547663e4fd..1a21667f3f7c 100644
> >>>> --- a/fs/ext4/inode.c
> >>>> +++ b/fs/ext4/inode.c
> >>>> @@ -4147,6 +4147,53 @@ static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
> >>>>  	return 0;
> >>>>  }
> >>>>
> >>>> +static int ext4_iomap_buffered_zero_begin(struct inode *inode, loff_t offset,
> >>>> +			loff_t length, unsigned int flags, struct iomap *iomap,
> >>>> +			struct iomap *srcmap)
> >>>> +{
> >>>> +	struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap);
> >>>> +	struct ext4_map_blocks map;
> >>>> +	u8 blkbits = inode->i_blkbits;
> >>>> +	int ret;
> >>>> +
> >>>> +	ret = ext4_emergency_state(inode->i_sb);
> >>>> +	if (unlikely(ret))
> >>>> +		return ret;
> >>>> +
> >>>> +	if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
> >>>> +		return -EINVAL;
> >>>> +
> >>>> +	/* Calculate the first and last logical blocks respectively. */
> >>>> +	map.m_lblk = offset >> blkbits;
> >>>> +	map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
> >>>> +			  EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1;
> >>>> +
> >>>> +	ret = ext4_map_blocks(NULL, inode, &map, 0);
> >>>> +	if (ret < 0)
> >>>> +		return ret;
> >>>> +
> >>>> +	/*
> >>>> +	 * Look up dirty folios for unwritten mappings within EOF. Providing
> >>>> +	 * this bypasses the flush iomap uses to trigger extent conversion
> >>>> +	 * when unwritten mappings have dirty pagecache in need of zeroing.
> >>>> +	 */
> >>>> +	if ((map.m_flags & EXT4_MAP_UNWRITTEN) &&
> >>>> +	    map.m_lblk < EXT4_B_TO_LBLK(inode, i_size_read(inode))) {
> >>>> +		loff_t end;
> >>>> +
> >>>> +		end = iomap_fill_dirty_folios(iter, map.m_lblk << blkbits,
> >>>> +					      map.m_len << blkbits);
> >>>> +		if ((end >> blkbits) < map.m_lblk + map.m_len)
> >>>> +			map.m_len = (end >> blkbits) - map.m_lblk;
> >>>> +	}
> >>>> +
> >>>> +	ext4_set_iomap(inode, iomap, &map, offset, length, flags);
> >>>> +	return 0;
> >>>> +}
> >>>> +
> >>>> +const struct iomap_ops ext4_iomap_buffered_zero_ops = {
> >>>> +	.iomap_begin = ext4_iomap_buffered_zero_begin,
> >>>> +};
> >>>>
> >>>>  const struct iomap_ops ext4_iomap_buffered_write_ops = {
> >>>>  	.iomap_begin = ext4_iomap_buffered_write_begin,
> >>>> @@ -4611,6 +4658,17 @@ static int __ext4_block_zero_page_range(handle_t *handle,
> >>>>  	return err;
> >>>>  }
> >>>>
> >>>> +static inline int ext4_iomap_zero_range(struct inode *inode, loff_t from,
> >>>> +					loff_t length)
> >>>> +{
> >>>> +	WARN_ON_ONCE(!inode_is_locked(inode) &&
> >>>> +		     !rwsem_is_locked(&inode->i_mapping->invalidate_lock));
> >>>> +
> >>>> +	return iomap_zero_range(inode, from, length, NULL,
> >>>> +				&ext4_iomap_buffered_zero_ops,
> >>>> +				&ext4_iomap_write_ops, NULL);
> >>>> +}
> >>>> +
> >>>>  /*
> >>>>   * ext4_block_zero_page_range() zeros out a mapping of length 'length'
> >>>>   * starting from file offset 'from'.  The range to be zero'd must
> >>>> @@ -4636,6 +4694,8 @@ static int ext4_block_zero_page_range(handle_t *handle,
> >>>>  	if (IS_DAX(inode)) {
> >>>>  		return dax_zero_range(inode, from, length, NULL,
> >>>>  				      &ext4_iomap_ops);
> >>>> +	} else if (ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP)) {
> >>>> +		return ext4_iomap_zero_range(inode, from, length);
> >>>>  	}
> >>>>  	return __ext4_block_zero_page_range(handle, mapping, from, length);
> >>>>  }
> >>>>
> >>>> The problem is most calls to ext4_block_zero_page_range() occur under
> >>>> an active journal handle, so I can reproduce the deadlock issue easily
> >>>> without this series.
> >>>>
> >>>>>
> >>>>> FWIW, I think your suggestion is reasonable, but I'm also curious what
> >>>>> the error handling would look like in ext4. Do you expect to the fail
> >>>>> the higher level operation, for example? Cycle locks and retry, etc.?
> >>>>
> >>>> Originally, I wanted ext4_block_zero_page_range() to return a failure
> >>>> to the higher level operation. However, unfortunately, after my testing
> >>>> today, I discovered that even though we implement this, this series still
> >>>> cannot resolve the issue. The corner case is:
> >>>>
> >>>> Assume we have a dirty folio covers both hole and unwritten mappings.
> >>>>
> >>>>    |- dirty folio  -|
> >>>>    [hhhhhhhhuuuuuuuu]                h:hole, u:unwrtten
> >>>>
> >>>> If we punch the range of the hole, ext4_punch_hole()->
> >>>> ext4_zero_partial_blocks() will zero out the first half of the dirty folio.
> >>>> Then, ext4_iomap_buffered_zero_begin() will skip adding this dirty folio
> >>>> since the target range is a hole. Finally, iomap_zero_range() will still
> >>>> flush this whole folio and lead to deadlock during writeback the latter
> >>>> half of the folio.
> >>>>
> >>>
> >>> Hmm.. Ok. So it seems there are at least a couple ways around this
> >>> particular quirk. I suspect one is that you could just call the fill
> >>> helper in the hole case as well, but that's kind of a hack and not
> >>> really intended use.
> >>>
> >>> The other way goes back to the fact that the flush for the hole case was
> >>> kind of a corner case hack in the first place. The original comment for
> >>> that seems to have been dropped, but see commit 7d9b474ee4cc ("iomap:
> >>> make zero range flush conditional on unwritten mappings") for reference
> >>> to the original intent.
> >>>
> >>> I'd have to go back and investigate if something regresses with that
> >>> taken out, but my recollection is that was something that needed proper
> >>> fixing eventually anyways. I'm particularly wondering if that is no
> >>> longer an issue now that pagecache_isize_extended() handles the post-eof
> >>> zeroing (the caveat being we might just need to call it in some
> >>> additional size extension cases besides just setattr/truncate).
> >>
> >> Yeah, I agree with you. I suppose the post-EOF partial folio zeroing in
> >> pagecache_isize_extended() should work.
> >>
> > 
> > Ok..
> > 
> >>>
> >>>>>
> >>>>> The reason I ask is because the folio_batch handling has come up through
> >>>>> discussions on this series. My position so far has been to keep it as a
> >>>>> separate allocation and to keep things simple since it is currently
> >>>>> isolated to zero range, but that may change if the usage spills over to
> >>>>> other operations (which seems expected at this point). I suspect that if
> >>>>> a filesystem actually depends on this for correct behavior, that is
> >>>>> another data point worth considering on that topic.
> >>>>>
> >>>>> So that has me wondering if it would be better/easier here to perhaps
> >>>>> embed the batch in iomap_iter, or maybe as an incremental step put it on
> >>>>> the stack in iomap_zero_range() and initialize the iomap_iter pointer
> >>>>> there instead of doing the dynamic allocation (then the fill helper
> >>>>> would set a flag to indicate the fs did pagecache lookup). Thoughts on
> >>>>> something like that?
> >>>>>
> >>>>> Also IIUC ext4-on-iomap is still a WIP and review on this series seems
> >>>>> to have mostly wound down. Any objection if the fix for that comes along
> >>>>> as a followup patch rather than a rework of this series?
> >>>>
> >>>> It seems that we don't need to modify this series, we need to consider
> >>>> other solutions to resolve this deadlock issue.
> >>>>
> >>>> In my v1 ext4-on-iomap series [1], I resolved this issue by moving all
> >>>> instances of ext4_block_zero_page_range() out of the running journal
> >>>> handle(please see patch 19-21). But I don't think this is a good solution
> >>>> since it's complex and fragile. Besides, after commit c7fc0366c6562
> >>>> ("ext4: partial zero eof block on unaligned inode size extension"), you
> >>>> added more invocations of ext4_zero_partial_blocks(), and the situation
> >>>> has become more complicated (Althrough I think the calls in the three
> >>>> write_end callbacks can be removed).
> >>>>
> >>>> Besides, IIUC, it seems that ext4 doesn't need to flush dirty folios
> >>>> over unwritten mappings before zeroing partial blocks. This is because
> >>>> ext4 always zeroes the in-memory page cache before zeroing(e.g, in
> >>>> ext4_setattr() and ext4_punch_hole()), it means if the target range is
> >>>> still dirty and unwritten when calling ext4_block_zero_page_range(), it
> >>>> must has already been zeroed. Was I missing something? Therefore, I was
> >>>> wondering if there are any ways to prevent flushing in
> >>>> iomap_zero_range()? Any ideas?
> >>>
> >>> It's certainly possible that the quirk fixed by the flush the hole case
> >>> was never a problem on ext4, if that's what you mean. Most of the
> >>> testing for this was on XFS since ext4 hadn't used iomap for buffered
> >>> writes.
> >>>
> >>> At the end of the day, the batch mechanism is intended to facilitate
> >>> avoiding the flush entirely. I'm still paging things back in here.. but
> >>> if we had two smallish changes to this code path to 1. eliminate the
> >>> dynamic folio_batch allocation and 2. drop the flush on hole mapping
> >>> case, would that address the issues with iomap zero range for ext4?
> >>>
> >>
> >> Thank you for looking at this!
> >>
> >> I made a simple modification to the iomap_zero_range() function based
> >> on the second solution you mentioned, then tested it using kvm-xfstests
> >> these days. This solution works fine on ext4 and I don't find any other
> >> risks by now. (Since my testing environment has sufficient memory, I
> >> have not yet handled the case of memory allocation failure).
> >>
> > 
> > Great, thanks for evaluating that. I've been playing around with the
> > exact same change the past few days. As it turns out, this still breaks
> > something with XFS. I've narrowed it down to an interaction between a
> > large eof folio that fails to split on truncate due to being dirty, COW
> > prealloc and insert range racing with writeback in such a way that this
> > results in a mapped post-eof block on the file. Technically that isn't
> > the end of the world so long as it is zeroed, but a subsequent zero
> > range can warn if the folio is reclaimed and we now end up with a new
> > one that starts beyond EOF, because those folios don't write back.
> > 
> > I have a mix of hacks that seems to address the generic/363 failure, but
> > it still needs further testing and analysis to unwind my mess of various
> > experiments and whatnot. ;P
> 
> OK, thanks for debugging and analyzing this.
> 
> > 
> >> --- a/fs/iomap/buffered-io.c
> >> +++ b/fs/iomap/buffered-io.c
> >> @@ -1520,7 +1520,7 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
> >> 		     srcmap->type == IOMAP_UNWRITTEN)) {
> >> 			s64 status;
> >>
> >> -			if (range_dirty) {
> >> +			if (range_dirty && srcmap->type == IOMAP_UNWRITTEN) {
> >> 				range_dirty = false;
> >> 				status = iomap_zero_iter_flush_and_stale(&iter);
> >> 			} else {
> >>
> >> Another thing I want to mention (although there are no real issues at
> >> the moment, I still want to mention it) is that there appears to be
> >> no consistency guarantee between the lookup of the mapping and the
> >> follo_batch. For example, assume we have a file which contains two
> >> dirty folio and two unwritten extents, one folio corresponds to one
> >> extent. We zero out these two folios.
> >>
> >>     | dirty folio 1  || dirty folio 2   |
> >>     [uuuuuuuuuuuuuuuu][uuuuuuuuuuuuuuuuu]
> >>
> >> In the first call to ->iomap_begin(), we get the unwritten extent 1.
> >> At the same time, another thread writes back folio 1 and clears this
> >> folio, so this folio will not be added to the follo_batch. Then
> >> iomap_zero_range() will still flush those two folios. When flushing
> >> the second folio, there is still a risk of deadlock due to changes in
> >> metadata.
> >>
> > 
> > Hmm.. not sure I follow the example. The folio batch should include the
> > folio in any case other than where it can look up, lock it and confirm
> > it is clean. If the folio is clean and thus not included, the iomap
> > logic should still see the empty folio batch and skip over the mapping
> > if unwritten. (I want to replace this with a flag to address your memory
> > allocation concern, but that is orthogonal to this logic.)
> > 
> > Of course this should happen under appropriate fs locks such that iomap
> > either sees the folio in dirty/writeback state where the mapping is
> > unwritten, or if the folio has been cleaned, the mapping is reported as
> > written.
> > 
> > If I'm still missing something with your example above, can you
> > elaborate a bit further? Thanks.
> > 
> 
> Sorry for not make things clear. The race condition is the following,
> 
> zero range                            sync_file_range
> iomap_zero_range() //folio 1+2
>  range_dirty = filemap_range_needs_writeback()
>  //range_dirty is set to 'ture'
>  iomap_iter()
>    ext4_iomap_buffer_zero_begin()
>      ext4_map_blocks()
>      //get unwritten extent 1
>                                       sync_file_range() //folio 1
>                                         iomap_writepages()
>                                         ...
>                                         iomap_finish_ioend()
>                                           folio_end_writeback()
>                                           //clear folio 1, and
>                                           //extent 1 becomes written
>      iomap_fill_dirty_folios()
>      //do not add folio 1 to batch

I think the issue here is that this needs serialization between the
lookups and extent conversion. I.e., XFS handles this by doing the folio
lookup under the same locks used for looking up the extent. So in this
scenario, that ensures that the only way we don't add the folio to the
batch is if the mapping has already been converted by writeback
completion..

>   iomap_zero_iter_flush_and_stale()
>   //!fbatch && IOMAP_UNWRITTEN && range_dirty
>   //flush folio 1+2, folio 2 is still dirty, then deadlock
> 

I also think the failure characteristic here is different. In this case
you'd see an empty fbatch because at least on the lookup side the
mapping is presumed to be unwritten. So that means we still wouldn't
flush, but the zeroing would probably race with writeback such that it
skips zeroing the blocks when it shouldn't.

> Besides, if the range of folio 1 is initially clean and unwritten
> (folio 2 is still dirty), the flush can also be triggered without the
> concurrent sync_file_range.
> 
> The problem is that we initially checked `range_dirty` only once, and
> if was done for the entire zero range, instead of checking it each
> iteration, and there is no fs lock can prevent a concurrent write-back.
> Perhaps we need to check 'dirty_range' for each iteration?
> 

I don't think this needs to prevent writeback, but is there not any
locking to protect extent lookups vs. extent conversion (via writeback)?

FWIW, I think it's a little hard to reason about some of these
interactions because of the pending tweaks that aren't implemented yet.
What I'd like to do is get another version of this posted (mostly as
is), hopefully get it landed into -next or whatever, then get another
series going with this handful of tweaks we're discussing to prepare for
ext4 on iomap. From there we can work out any remaining details on
things that might need to change on either side. Sound Ok?

Brian

> Regards,
> Yi.
> 



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 3/7] iomap: optional zero range dirty folio processing
  2025-08-06 13:25                   ` Brian Foster
@ 2025-08-07  4:58                     ` Zhang Yi
  0 siblings, 0 replies; 37+ messages in thread
From: Zhang Yi @ 2025-08-07  4:58 UTC (permalink / raw)
  To: Brian Foster
  Cc: linux-fsdevel, linux-xfs, linux-mm, hch, willy, Darrick J. Wong,
	Ext4 Developers List

On 2025/8/6 21:25, Brian Foster wrote:
> On Wed, Aug 06, 2025 at 11:10:30AM +0800, Zhang Yi wrote:
>> On 2025/8/5 21:08, Brian Foster wrote:
>>> On Sat, Aug 02, 2025 at 03:19:54PM +0800, Zhang Yi wrote:
>>>> On 2025/7/30 21:17, Brian Foster wrote:
>>>>> On Sat, Jul 19, 2025 at 07:07:43PM +0800, Zhang Yi wrote:
>>>>>> On 2025/7/18 21:48, Brian Foster wrote:
>>>>>>> On Fri, Jul 18, 2025 at 07:30:10PM +0800, Zhang Yi wrote:
>>>>>>>> On 2025/7/15 13:22, Darrick J. Wong wrote:
>>>>>>>>> On Mon, Jul 14, 2025 at 04:41:18PM -0400, Brian Foster wrote:
>>>>>>>>>> The only way zero range can currently process unwritten mappings
>>>>>>>>>> with dirty pagecache is to check whether the range is dirty before
>>>>>>>>>> mapping lookup and then flush when at least one underlying mapping
>>>>>>>>>> is unwritten. This ordering is required to prevent iomap lookup from
>>>>>>>>>> racing with folio writeback and reclaim.
>>>>>>>>>>
>>>>>>>>>> Since zero range can skip ranges of unwritten mappings that are
>>>>>>>>>> clean in cache, this operation can be improved by allowing the
>>>>>>>>>> filesystem to provide a set of dirty folios that require zeroing. In
>>>>>>>>>> turn, rather than flush or iterate file offsets, zero range can
>>>>>>>>>> iterate on folios in the batch and advance over clean or uncached
>>>>>>>>>> ranges in between.
>>>>>>>>>>
>>>>>>>>>> Add a folio_batch in struct iomap and provide a helper for fs' to
>>>>>>>>>
>>>>>>>>> /me confused by the single quote; is this supposed to read:
>>>>>>>>>
>>>>>>>>> "...for the fs to populate..."?
>>>>>>>>>
>>>>>>>>> Either way the code changes look like a reasonable thing to do for the
>>>>>>>>> pagecache (try to grab a bunch of dirty folios while XFS holds the
>>>>>>>>> mapping lock) so
>>>>>>>>>
>>>>>>>>> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
>>>>>>>>>
>>>>>>>>> --D
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> populate the batch at lookup time. Update the folio lookup path to
>>>>>>>>>> return the next folio in the batch, if provided, and advance the
>>>>>>>>>> iter if the folio starts beyond the current offset.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Brian Foster <bfoster@redhat.com>
>>>>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>>>>>>>>>> ---
>>>>>>>>>>  fs/iomap/buffered-io.c | 89 +++++++++++++++++++++++++++++++++++++++---
>>>>>>>>>>  fs/iomap/iter.c        |  6 +++
>>>>>>>>>>  include/linux/iomap.h  |  4 ++
>>>>>>>>>>  3 files changed, 94 insertions(+), 5 deletions(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
>>>>>>>>>> index 38da2fa6e6b0..194e3cc0857f 100644
>>>>>>>>>> --- a/fs/iomap/buffered-io.c
>>>>>>>>>> +++ b/fs/iomap/buffered-io.c
>>>>>>>> [...]
>>>>>>>>>> @@ -1398,6 +1452,26 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
>>>>>>>>>>  	return status;
>>>>>>>>>>  }
>>>>>>>>>>  
>>>>>>>>>> +loff_t
>>>>>>>>>> +iomap_fill_dirty_folios(
>>>>>>>>>> +	struct iomap_iter	*iter,
>>>>>>>>>> +	loff_t			offset,
>>>>>>>>>> +	loff_t			length)
>>>>>>>>>> +{
>>>>>>>>>> +	struct address_space	*mapping = iter->inode->i_mapping;
>>>>>>>>>> +	pgoff_t			start = offset >> PAGE_SHIFT;
>>>>>>>>>> +	pgoff_t			end = (offset + length - 1) >> PAGE_SHIFT;
>>>>>>>>>> +
>>>>>>>>>> +	iter->fbatch = kmalloc(sizeof(struct folio_batch), GFP_KERNEL);
>>>>>>>>>> +	if (!iter->fbatch)
>>>>>>>>
>>>>>>>> Hi, Brian!
>>>>>>>>
>>>>>>>> I think ext4 needs to be aware of this failure after it converts to use
>>>>>>>> iomap infrastructure. It is because if we fail to add dirty folios to the
>>>>>>>> fbatch, iomap_zero_range() will flush those unwritten and dirty range.
>>>>>>>> This could potentially lead to a deadlock, as most calls to
>>>>>>>> ext4_block_zero_page_range() occur under an active journal handle.
>>>>>>>> Writeback operations under an active journal handle may result in circular
>>>>>>>> waiting within journal transactions. So please return this error code, and
>>>>>>>> then ext4 can interrupt zero operations to prevent deadlock.
>>>>>>>>
>>>>>>>
>>>>>>> Hi Yi,
>>>>>>>
>>>>>>> Thanks for looking at this.
>>>>>>>
>>>>>>> Huh.. so the reason for falling back like this here is just that this
>>>>>>> was considered an optional optimization, with the flush in
>>>>>>> iomap_zero_range() being default fallback behavior. IIUC, what you're
>>>>>>> saying means that the current zero range behavior without this series is
>>>>>>> problematic for ext4-on-iomap..? 
>>>>>>
>>>>>> Yes.
>>>>>>
>>>>>>> If so, have you observed issues you can share details about?
>>>>>>
>>>>>> Sure.
>>>>>>
>>>>>> Before delving into the specific details of this issue, I would like
>>>>>> to provide some background information on the rule that ext4 cannot
>>>>>> wait for writeback in an active journal handle. If you are aware of
>>>>>> this background, please skip this paragraph. During ext4 writing back
>>>>>> the page cache, it may start a new journal handle to allocate blocks,
>>>>>> update the disksize, and convert unwritten extents after the I/O is
>>>>>> completed. When starting this new journal handle, if the current
>>>>>> running journal transaction is in the process of being submitted or
>>>>>> if the journal space is insufficient, it must wait for the ongoing
>>>>>> transaction to be completed, but the prerequisite for this is that all
>>>>>> currently running handles must be terminated. However, if we flush the
>>>>>> page cache under an active journal handle, we cannot stop it, which
>>>>>> may lead to a deadlock.
>>>>>>
>>>>>
>>>>> Ok, makes sense.
>>>>>
>>>>>> Now, the issue I have observed occurs when I attempt to use
>>>>>> iomap_zero_range() within ext4_block_zero_page_range(). My current
>>>>>> implementation are below(based on the latest fs-next).
>>>>>>
>>>>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>>>>> index 28547663e4fd..1a21667f3f7c 100644
>>>>>> --- a/fs/ext4/inode.c
>>>>>> +++ b/fs/ext4/inode.c
>>>>>> @@ -4147,6 +4147,53 @@ static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
>>>>>>  	return 0;
>>>>>>  }
>>>>>>
>>>>>> +static int ext4_iomap_buffered_zero_begin(struct inode *inode, loff_t offset,
>>>>>> +			loff_t length, unsigned int flags, struct iomap *iomap,
>>>>>> +			struct iomap *srcmap)
>>>>>> +{
>>>>>> +	struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap);
>>>>>> +	struct ext4_map_blocks map;
>>>>>> +	u8 blkbits = inode->i_blkbits;
>>>>>> +	int ret;
>>>>>> +
>>>>>> +	ret = ext4_emergency_state(inode->i_sb);
>>>>>> +	if (unlikely(ret))
>>>>>> +		return ret;
>>>>>> +
>>>>>> +	if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
>>>>>> +		return -EINVAL;
>>>>>> +
>>>>>> +	/* Calculate the first and last logical blocks respectively. */
>>>>>> +	map.m_lblk = offset >> blkbits;
>>>>>> +	map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
>>>>>> +			  EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1;
>>>>>> +
>>>>>> +	ret = ext4_map_blocks(NULL, inode, &map, 0);
>>>>>> +	if (ret < 0)
>>>>>> +		return ret;
>>>>>> +
>>>>>> +	/*
>>>>>> +	 * Look up dirty folios for unwritten mappings within EOF. Providing
>>>>>> +	 * this bypasses the flush iomap uses to trigger extent conversion
>>>>>> +	 * when unwritten mappings have dirty pagecache in need of zeroing.
>>>>>> +	 */
>>>>>> +	if ((map.m_flags & EXT4_MAP_UNWRITTEN) &&
>>>>>> +	    map.m_lblk < EXT4_B_TO_LBLK(inode, i_size_read(inode))) {
>>>>>> +		loff_t end;
>>>>>> +
>>>>>> +		end = iomap_fill_dirty_folios(iter, map.m_lblk << blkbits,
>>>>>> +					      map.m_len << blkbits);
>>>>>> +		if ((end >> blkbits) < map.m_lblk + map.m_len)
>>>>>> +			map.m_len = (end >> blkbits) - map.m_lblk;
>>>>>> +	}
>>>>>> +
>>>>>> +	ext4_set_iomap(inode, iomap, &map, offset, length, flags);
>>>>>> +	return 0;
>>>>>> +}
>>>>>> +
>>>>>> +const struct iomap_ops ext4_iomap_buffered_zero_ops = {
>>>>>> +	.iomap_begin = ext4_iomap_buffered_zero_begin,
>>>>>> +};
>>>>>>
>>>>>>  const struct iomap_ops ext4_iomap_buffered_write_ops = {
>>>>>>  	.iomap_begin = ext4_iomap_buffered_write_begin,
>>>>>> @@ -4611,6 +4658,17 @@ static int __ext4_block_zero_page_range(handle_t *handle,
>>>>>>  	return err;
>>>>>>  }
>>>>>>
>>>>>> +static inline int ext4_iomap_zero_range(struct inode *inode, loff_t from,
>>>>>> +					loff_t length)
>>>>>> +{
>>>>>> +	WARN_ON_ONCE(!inode_is_locked(inode) &&
>>>>>> +		     !rwsem_is_locked(&inode->i_mapping->invalidate_lock));
>>>>>> +
>>>>>> +	return iomap_zero_range(inode, from, length, NULL,
>>>>>> +				&ext4_iomap_buffered_zero_ops,
>>>>>> +				&ext4_iomap_write_ops, NULL);
>>>>>> +}
>>>>>> +
>>>>>>  /*
>>>>>>   * ext4_block_zero_page_range() zeros out a mapping of length 'length'
>>>>>>   * starting from file offset 'from'.  The range to be zero'd must
>>>>>> @@ -4636,6 +4694,8 @@ static int ext4_block_zero_page_range(handle_t *handle,
>>>>>>  	if (IS_DAX(inode)) {
>>>>>>  		return dax_zero_range(inode, from, length, NULL,
>>>>>>  				      &ext4_iomap_ops);
>>>>>> +	} else if (ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP)) {
>>>>>> +		return ext4_iomap_zero_range(inode, from, length);
>>>>>>  	}
>>>>>>  	return __ext4_block_zero_page_range(handle, mapping, from, length);
>>>>>>  }
>>>>>>
>>>>>> The problem is most calls to ext4_block_zero_page_range() occur under
>>>>>> an active journal handle, so I can reproduce the deadlock issue easily
>>>>>> without this series.
>>>>>>
>>>>>>>
>>>>>>> FWIW, I think your suggestion is reasonable, but I'm also curious what
>>>>>>> the error handling would look like in ext4. Do you expect to the fail
>>>>>>> the higher level operation, for example? Cycle locks and retry, etc.?
>>>>>>
>>>>>> Originally, I wanted ext4_block_zero_page_range() to return a failure
>>>>>> to the higher level operation. However, unfortunately, after my testing
>>>>>> today, I discovered that even though we implement this, this series still
>>>>>> cannot resolve the issue. The corner case is:
>>>>>>
>>>>>> Assume we have a dirty folio covers both hole and unwritten mappings.
>>>>>>
>>>>>>    |- dirty folio  -|
>>>>>>    [hhhhhhhhuuuuuuuu]                h:hole, u:unwrtten
>>>>>>
>>>>>> If we punch the range of the hole, ext4_punch_hole()->
>>>>>> ext4_zero_partial_blocks() will zero out the first half of the dirty folio.
>>>>>> Then, ext4_iomap_buffered_zero_begin() will skip adding this dirty folio
>>>>>> since the target range is a hole. Finally, iomap_zero_range() will still
>>>>>> flush this whole folio and lead to deadlock during writeback the latter
>>>>>> half of the folio.
>>>>>>
>>>>>
>>>>> Hmm.. Ok. So it seems there are at least a couple ways around this
>>>>> particular quirk. I suspect one is that you could just call the fill
>>>>> helper in the hole case as well, but that's kind of a hack and not
>>>>> really intended use.
>>>>>
>>>>> The other way goes back to the fact that the flush for the hole case was
>>>>> kind of a corner case hack in the first place. The original comment for
>>>>> that seems to have been dropped, but see commit 7d9b474ee4cc ("iomap:
>>>>> make zero range flush conditional on unwritten mappings") for reference
>>>>> to the original intent.
>>>>>
>>>>> I'd have to go back and investigate if something regresses with that
>>>>> taken out, but my recollection is that was something that needed proper
>>>>> fixing eventually anyways. I'm particularly wondering if that is no
>>>>> longer an issue now that pagecache_isize_extended() handles the post-eof
>>>>> zeroing (the caveat being we might just need to call it in some
>>>>> additional size extension cases besides just setattr/truncate).
>>>>
>>>> Yeah, I agree with you. I suppose the post-EOF partial folio zeroing in
>>>> pagecache_isize_extended() should work.
>>>>
>>>
>>> Ok..
>>>
>>>>>
>>>>>>>
>>>>>>> The reason I ask is because the folio_batch handling has come up through
>>>>>>> discussions on this series. My position so far has been to keep it as a
>>>>>>> separate allocation and to keep things simple since it is currently
>>>>>>> isolated to zero range, but that may change if the usage spills over to
>>>>>>> other operations (which seems expected at this point). I suspect that if
>>>>>>> a filesystem actually depends on this for correct behavior, that is
>>>>>>> another data point worth considering on that topic.
>>>>>>>
>>>>>>> So that has me wondering if it would be better/easier here to perhaps
>>>>>>> embed the batch in iomap_iter, or maybe as an incremental step put it on
>>>>>>> the stack in iomap_zero_range() and initialize the iomap_iter pointer
>>>>>>> there instead of doing the dynamic allocation (then the fill helper
>>>>>>> would set a flag to indicate the fs did pagecache lookup). Thoughts on
>>>>>>> something like that?
>>>>>>>
>>>>>>> Also IIUC ext4-on-iomap is still a WIP and review on this series seems
>>>>>>> to have mostly wound down. Any objection if the fix for that comes along
>>>>>>> as a followup patch rather than a rework of this series?
>>>>>>
>>>>>> It seems that we don't need to modify this series, we need to consider
>>>>>> other solutions to resolve this deadlock issue.
>>>>>>
>>>>>> In my v1 ext4-on-iomap series [1], I resolved this issue by moving all
>>>>>> instances of ext4_block_zero_page_range() out of the running journal
>>>>>> handle(please see patch 19-21). But I don't think this is a good solution
>>>>>> since it's complex and fragile. Besides, after commit c7fc0366c6562
>>>>>> ("ext4: partial zero eof block on unaligned inode size extension"), you
>>>>>> added more invocations of ext4_zero_partial_blocks(), and the situation
>>>>>> has become more complicated (Althrough I think the calls in the three
>>>>>> write_end callbacks can be removed).
>>>>>>
>>>>>> Besides, IIUC, it seems that ext4 doesn't need to flush dirty folios
>>>>>> over unwritten mappings before zeroing partial blocks. This is because
>>>>>> ext4 always zeroes the in-memory page cache before zeroing(e.g, in
>>>>>> ext4_setattr() and ext4_punch_hole()), it means if the target range is
>>>>>> still dirty and unwritten when calling ext4_block_zero_page_range(), it
>>>>>> must has already been zeroed. Was I missing something? Therefore, I was
>>>>>> wondering if there are any ways to prevent flushing in
>>>>>> iomap_zero_range()? Any ideas?
>>>>>
>>>>> It's certainly possible that the quirk fixed by the flush the hole case
>>>>> was never a problem on ext4, if that's what you mean. Most of the
>>>>> testing for this was on XFS since ext4 hadn't used iomap for buffered
>>>>> writes.
>>>>>
>>>>> At the end of the day, the batch mechanism is intended to facilitate
>>>>> avoiding the flush entirely. I'm still paging things back in here.. but
>>>>> if we had two smallish changes to this code path to 1. eliminate the
>>>>> dynamic folio_batch allocation and 2. drop the flush on hole mapping
>>>>> case, would that address the issues with iomap zero range for ext4?
>>>>>
>>>>
>>>> Thank you for looking at this!
>>>>
>>>> I made a simple modification to the iomap_zero_range() function based
>>>> on the second solution you mentioned, then tested it using kvm-xfstests
>>>> these days. This solution works fine on ext4 and I don't find any other
>>>> risks by now. (Since my testing environment has sufficient memory, I
>>>> have not yet handled the case of memory allocation failure).
>>>>
>>>
>>> Great, thanks for evaluating that. I've been playing around with the
>>> exact same change the past few days. As it turns out, this still breaks
>>> something with XFS. I've narrowed it down to an interaction between a
>>> large eof folio that fails to split on truncate due to being dirty, COW
>>> prealloc and insert range racing with writeback in such a way that this
>>> results in a mapped post-eof block on the file. Technically that isn't
>>> the end of the world so long as it is zeroed, but a subsequent zero
>>> range can warn if the folio is reclaimed and we now end up with a new
>>> one that starts beyond EOF, because those folios don't write back.
>>>
>>> I have a mix of hacks that seems to address the generic/363 failure, but
>>> it still needs further testing and analysis to unwind my mess of various
>>> experiments and whatnot. ;P
>>
>> OK, thanks for debugging and analyzing this.
>>
>>>
>>>> --- a/fs/iomap/buffered-io.c
>>>> +++ b/fs/iomap/buffered-io.c
>>>> @@ -1520,7 +1520,7 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
>>>> 		     srcmap->type == IOMAP_UNWRITTEN)) {
>>>> 			s64 status;
>>>>
>>>> -			if (range_dirty) {
>>>> +			if (range_dirty && srcmap->type == IOMAP_UNWRITTEN) {
>>>> 				range_dirty = false;
>>>> 				status = iomap_zero_iter_flush_and_stale(&iter);
>>>> 			} else {
>>>>
>>>> Another thing I want to mention (although there are no real issues at
>>>> the moment, I still want to mention it) is that there appears to be
>>>> no consistency guarantee between the lookup of the mapping and the
>>>> follo_batch. For example, assume we have a file which contains two
>>>> dirty folio and two unwritten extents, one folio corresponds to one
>>>> extent. We zero out these two folios.
>>>>
>>>>     | dirty folio 1  || dirty folio 2   |
>>>>     [uuuuuuuuuuuuuuuu][uuuuuuuuuuuuuuuuu]
>>>>
>>>> In the first call to ->iomap_begin(), we get the unwritten extent 1.
>>>> At the same time, another thread writes back folio 1 and clears this
>>>> folio, so this folio will not be added to the follo_batch. Then
>>>> iomap_zero_range() will still flush those two folios. When flushing
>>>> the second folio, there is still a risk of deadlock due to changes in
>>>> metadata.
>>>>
>>>
>>> Hmm.. not sure I follow the example. The folio batch should include the
>>> folio in any case other than where it can look up, lock it and confirm
>>> it is clean. If the folio is clean and thus not included, the iomap
>>> logic should still see the empty folio batch and skip over the mapping
>>> if unwritten. (I want to replace this with a flag to address your memory
>>> allocation concern, but that is orthogonal to this logic.)
>>>
>>> Of course this should happen under appropriate fs locks such that iomap
>>> either sees the folio in dirty/writeback state where the mapping is
>>> unwritten, or if the folio has been cleaned, the mapping is reported as
>>> written.
>>>
>>> If I'm still missing something with your example above, can you
>>> elaborate a bit further? Thanks.
>>>
>>
>> Sorry for not make things clear. The race condition is the following,
>>
>> zero range                            sync_file_range
>> iomap_zero_range() //folio 1+2
>>  range_dirty = filemap_range_needs_writeback()
>>  //range_dirty is set to 'ture'
>>  iomap_iter()
>>    ext4_iomap_buffer_zero_begin()
>>      ext4_map_blocks()
>>      //get unwritten extent 1
>>                                       sync_file_range() //folio 1
>>                                         iomap_writepages()
>>                                         ...
>>                                         iomap_finish_ioend()
>>                                           folio_end_writeback()
>>                                           //clear folio 1, and
>>                                           //extent 1 becomes written
>>      iomap_fill_dirty_folios()
>>      //do not add folio 1 to batch
> 
> I think the issue here is that this needs serialization between the
> lookups and extent conversion. I.e., XFS handles this by doing the folio
> lookup under the same locks used for looking up the extent. So in this
> scenario, that ensures that the only way we don't add the folio to the
> batch is if the mapping has already been converted by writeback
> completion..

Yeah, XFS serializes this using ip->i_lock. It performs mapping and
folio lookup under XFS_ILOCK_EXCL, and the writeback conversion should
also take this lock. Therefore, this race condition cannot happen in
XFS.

However, ext4 currently does not have such a lock that can provide this
serialization. Perhaps we could reuse EXT4_I(inode)->i_data_sem but we
need further analysis. At the moment, this case should not occur and
does not cause any real issues, so we don't need to address it now. We
can consider solutions when it becomes truly necessary in the future.

> 
>>   iomap_zero_iter_flush_and_stale()
>>   //!fbatch && IOMAP_UNWRITTEN && range_dirty
>>   //flush folio 1+2, folio 2 is still dirty, then deadlock
>>
> 
> I also think the failure characteristic here is different. In this case
> you'd see an empty fbatch because at least on the lookup side the
> mapping is presumed to be unwritten. So that means we still wouldn't
> flush, but the zeroing would probably race with writeback such that it
> skips zeroing the blocks when it shouldn't.
> 
>> Besides, if the range of folio 1 is initially clean and unwritten
>> (folio 2 is still dirty), the flush can also be triggered without the
>> concurrent sync_file_range.
>>
>> The problem is that we initially checked `range_dirty` only once, and
>> if was done for the entire zero range, instead of checking it each
>> iteration, and there is no fs lock can prevent a concurrent write-back.
>> Perhaps we need to check 'dirty_range' for each iteration?
>>
> 
> I don't think this needs to prevent writeback, but is there not any
> locking to protect extent lookups vs. extent conversion (via writeback)?
> 
> FWIW, I think it's a little hard to reason about some of these
> interactions because of the pending tweaks that aren't implemented yet.
> What I'd like to do is get another version of this posted (mostly as
> is), hopefully get it landed into -next or whatever, then get another
> series going with this handful of tweaks we're discussing to prepare for
> ext4 on iomap. From there we can work out any remaining details on
> things that might need to change on either side. Sound Ok?
> 

Yeah, it looks good to me. We should expedite merging of this series as
soon as possible to allow us to continue advancing the work on ext4 on
iomap.

Thanks,
Yi.



^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2025-08-07  4:58 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-14 20:41 [PATCH v3 0/7] iomap: zero range folio batch support Brian Foster
2025-07-14 20:41 ` [PATCH v3 1/7] filemap: add helper to look up dirty folios in a range Brian Foster
2025-07-15  5:20   ` Darrick J. Wong
2025-07-14 20:41 ` [PATCH v3 2/7] iomap: remove pos+len BUG_ON() to after folio lookup Brian Foster
2025-07-15  5:14   ` Darrick J. Wong
2025-07-14 20:41 ` [PATCH v3 3/7] iomap: optional zero range dirty folio processing Brian Foster
2025-07-15  5:22   ` Darrick J. Wong
2025-07-15 12:35     ` Brian Foster
2025-07-18 11:30     ` Zhang Yi
2025-07-18 13:48       ` Brian Foster
2025-07-19 11:07         ` Zhang Yi
2025-07-21  8:47           ` Zhang Yi
2025-07-28 12:57             ` Zhang Yi
2025-07-30 13:19               ` Brian Foster
2025-08-02  7:26                 ` Zhang Yi
2025-07-30 13:17           ` Brian Foster
2025-08-02  7:19             ` Zhang Yi
2025-08-05 13:08               ` Brian Foster
2025-08-06  3:10                 ` Zhang Yi
2025-08-06 13:25                   ` Brian Foster
2025-08-07  4:58                     ` Zhang Yi
2025-07-14 20:41 ` [PATCH v3 4/7] xfs: always trim mapping to requested range for zero range Brian Foster
2025-07-14 20:41 ` [PATCH v3 5/7] xfs: fill dirty folios on zero range of unwritten mappings Brian Foster
2025-07-15  5:28   ` Darrick J. Wong
2025-07-15 12:35     ` Brian Foster
2025-07-15 14:19       ` Darrick J. Wong
2025-07-14 20:41 ` [PATCH v3 6/7] iomap: remove old partial eof zeroing optimization Brian Foster
2025-07-15  5:34   ` Darrick J. Wong
2025-07-15 12:36     ` Brian Foster
2025-07-15 14:37       ` Darrick J. Wong
2025-07-15 16:20         ` Brian Foster
2025-07-15 16:30           ` Darrick J. Wong
2025-07-14 20:41 ` [PATCH v3 7/7] xfs: error tag to force zeroing on debug kernels Brian Foster
2025-07-15  5:24   ` Darrick J. Wong
2025-07-15 12:39     ` Brian Foster
2025-07-15 14:30       ` Darrick J. Wong
2025-07-15 16:20         ` Brian Foster

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).