[PATCH 0/7] iomap: zero range folio batch support

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/7] iomap: zero range folio batch support
@ 2025-06-05 17:33 Brian Foster
  2025-06-05 17:33 ` [PATCH 1/7] iomap: move pos+len BUG_ON() to after folio lookup Brian Foster
                   ` (6 more replies)
  0 siblings, 7 replies; 40+ messages in thread
From: Brian Foster @ 2025-06-05 17:33 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-xfs, linux-mm

Hi all,

Here's a first real v1 of folio batch support for iomap. This initially
only targets zero range, the use case being zeroing of dirty folios over
unwritten mappings. There is potential to support other operations in
the future: iomap seek data/hole has similar raciness issues as zero
range, the prospect of using this for buffered write has been raised for
granular locking purposes, etc.

The one major caveat with this zero range implementation is that it
doesn't look at iomap_folio_state to determine whether to zero a
sub-folio portion of the folio. Instead it just relies on whether the
folio was dirty or not. This means that spurious zeroing of unwritten
ranges is possible if a folio is dirty but the target range includes a
subrange that is not.

The reasoning is that this is essentially a complexity tradeoff. The
current use cases for iomap_zero_range() are limited mostly to partial
block zeroing scenarios. It's relatively harmless to zero an unwritten
block (i.e. not a correctness issue), and this is something that
filesystems have done in the past without much notice or issue. The
advantage is less code and this makes it a little easier to use a
filemap lookup function for the batch rather than open coding more logic
in iomap. That said, this can probably be enhanced to look at ifs in the
future if the use case expands and/or other operations justify it.

WRT testing, I've tested with and without a local hack to redirect
fallocate zero range calls to iomap_zero_range() in XFS. This helps test
beyond the partial block/folio use case, i.e. to cover boundary
conditions like full folio batch handling, etc. I recently added patch 7
in spirit of that, which turns this logic into an XFS errortag. Further
comments on that are inline with patch 7.

Thoughts, reviews, flames appreciated.

Brian

v1:
- Dropped most prep patches from previous version (merged separately).
- Reworked dirty folio lookup to use find_get_entry() loop (new patch
  for filemap helper).
- Misc. bug fixes, code cleanups, comments, etc.
- Added (RFC) prospective patch for wider zero range test coverage.
RFCv2: https://lore.kernel.org/linux-fsdevel/20241213150528.1003662-1-bfoster@redhat.com/
- Port onto incremental advance, drop patch 1 from RFCv1.
- Moved batch into iomap_iter, dynamically allocate and drop flag.
- Tweak XFS patch to always trim zero range on EOF boundary.
RFCv1: https://lore.kernel.org/linux-fsdevel/20241119154656.774395-1-bfoster@redhat.com/

Brian Foster (7):
  iomap: move pos+len BUG_ON() to after folio lookup
  filemap: add helper to look up dirty folios in a range
  iomap: optional zero range dirty folio processing
  xfs: always trim mapping to requested range for zero range
  xfs: fill dirty folios on zero range of unwritten mappings
  iomap: remove old partial eof zeroing optimization
  xfs: error tag to force zeroing on debug kernels

 fs/iomap/buffered-io.c       | 103 ++++++++++++++++++++++++-----------
 fs/iomap/iter.c              |   6 ++
 fs/xfs/libxfs/xfs_errortag.h |   4 +-
 fs/xfs/xfs_error.c           |   3 +
 fs/xfs/xfs_file.c            |  21 +++++--
 fs/xfs/xfs_iomap.c           |  38 ++++++++++---
 include/linux/iomap.h        |   4 ++
 include/linux/pagemap.h      |   2 +
 mm/filemap.c                 |  42 ++++++++++++++
 9 files changed, 176 insertions(+), 47 deletions(-)

-- 
2.49.0

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 1/7] iomap: move pos+len BUG_ON() to after folio lookup
  2025-06-05 17:33 [PATCH 0/7] iomap: zero range folio batch support Brian Foster
@ 2025-06-05 17:33 ` Brian Foster
  2025-06-09 16:16   ` Darrick J. Wong
  2025-06-05 17:33 ` [PATCH 2/7] filemap: add helper to look up dirty folios in a range Brian Foster
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 40+ messages in thread
From: Brian Foster @ 2025-06-05 17:33 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-xfs, linux-mm

The bug checks at the top of iomap_write_begin() assume the pos/len
reflect exactly the next range to process. This may no longer be the
case once the get folio path is able to process a folio batch from
the filesystem. Move the check a bit further down after the folio
lookup and range trim to verify everything lines up with the current
iomap.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/iomap/buffered-io.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 3729391a18f3..16499655e7b0 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -805,15 +805,12 @@ static int iomap_write_begin(struct iomap_iter *iter, struct folio **foliop,
 {
 	const struct iomap_folio_ops *folio_ops = iter->iomap.folio_ops;
 	const struct iomap *srcmap = iomap_iter_srcmap(iter);
-	loff_t pos = iter->pos;
+	loff_t pos;
 	u64 len = min_t(u64, SIZE_MAX, iomap_length(iter));
 	struct folio *folio;
 	int status = 0;
 
 	len = min_not_zero(len, *plen);
-	BUG_ON(pos + len > iter->iomap.offset + iter->iomap.length);
-	if (srcmap != &iter->iomap)
-		BUG_ON(pos + len > srcmap->offset + srcmap->length);
 
 	if (fatal_signal_pending(current))
 		return -EINTR;
@@ -843,6 +840,9 @@ static int iomap_write_begin(struct iomap_iter *iter, struct folio **foliop,
 	}
 
 	pos = iomap_trim_folio_range(iter, folio, poffset, &len);
+	BUG_ON(pos + len > iter->iomap.offset + iter->iomap.length);
+	if (srcmap != &iter->iomap)
+		BUG_ON(pos + len > srcmap->offset + srcmap->length);
 
 	if (srcmap->type == IOMAP_INLINE)
 		status = iomap_write_begin_inline(iter, folio);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 2/7] filemap: add helper to look up dirty folios in a range
  2025-06-05 17:33 [PATCH 0/7] iomap: zero range folio batch support Brian Foster
  2025-06-05 17:33 ` [PATCH 1/7] iomap: move pos+len BUG_ON() to after folio lookup Brian Foster
@ 2025-06-05 17:33 ` Brian Foster
  2025-06-09 15:48   ` Darrick J. Wong
  2025-06-10  4:22   ` Christoph Hellwig
  2025-06-05 17:33 ` [PATCH 3/7] iomap: optional zero range dirty folio processing Brian Foster
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 40+ messages in thread
From: Brian Foster @ 2025-06-05 17:33 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-xfs, linux-mm

Add a new filemap_get_folios_dirty() helper to look up existing dirty
folios in a range and add them to a folio_batch. This is to support
optimization of certain iomap operations that only care about dirty
folios in a target range. For example, zero range only zeroes the subset
of dirty pages over unwritten mappings, seek hole/data may use similar
logic in the future, etc.

Note that the helper is intended for use under internal fs locks.
Therefore it trylocks folios in order to filter out clean folios.
This loosely follows the logic from filemap_range_has_writeback().

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 include/linux/pagemap.h |  2 ++
 mm/filemap.c            | 42 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 44 insertions(+)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e63fbfbd5b0f..fb83ddf26621 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -941,6 +941,8 @@ unsigned filemap_get_folios_contig(struct address_space *mapping,
 		pgoff_t *start, pgoff_t end, struct folio_batch *fbatch);
 unsigned filemap_get_folios_tag(struct address_space *mapping, pgoff_t *start,
 		pgoff_t end, xa_mark_t tag, struct folio_batch *fbatch);
+unsigned filemap_get_folios_dirty(struct address_space *mapping,
+		pgoff_t *start, pgoff_t end, struct folio_batch *fbatch);
 
 /*
  * Returns locked page at given index in given cache, creating it if needed.
diff --git a/mm/filemap.c b/mm/filemap.c
index bada249b9fb7..d28e984cdfd4 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2334,6 +2334,48 @@ unsigned filemap_get_folios_tag(struct address_space *mapping, pgoff_t *start,
 }
 EXPORT_SYMBOL(filemap_get_folios_tag);
 
+unsigned filemap_get_folios_dirty(struct address_space *mapping, pgoff_t *start,
+			pgoff_t end, struct folio_batch *fbatch)
+{
+	XA_STATE(xas, &mapping->i_pages, *start);
+	struct folio *folio;
+
+	rcu_read_lock();
+	while ((folio = find_get_entry(&xas, end, XA_PRESENT)) != NULL) {
+		if (xa_is_value(folio))
+			continue;
+		if (folio_trylock(folio)) {
+			bool clean = !folio_test_dirty(folio) &&
+				     !folio_test_writeback(folio);
+			folio_unlock(folio);
+			if (clean) {
+				folio_put(folio);
+				continue;
+			}
+		}
+		if (!folio_batch_add(fbatch, folio)) {
+			unsigned long nr = folio_nr_pages(folio);
+			*start = folio->index + nr;
+			goto out;
+		}
+	}
+	/*
+	 * We come here when there is no page beyond @end. We take care to not
+	 * overflow the index @start as it confuses some of the callers. This
+	 * breaks the iteration when there is a page at index -1 but that is
+	 * already broke anyway.
+	 */
+	if (end == (pgoff_t)-1)
+		*start = (pgoff_t)-1;
+	else
+		*start = end + 1;
+out:
+	rcu_read_unlock();
+
+	return folio_batch_count(fbatch);
+}
+EXPORT_SYMBOL(filemap_get_folios_dirty);
+
 /*
  * CD/DVDs are error prone. When a medium error occurs, the driver may fail
  * a _large_ part of the i/o request. Imagine the worst scenario:
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 3/7] iomap: optional zero range dirty folio processing
  2025-06-05 17:33 [PATCH 0/7] iomap: zero range folio batch support Brian Foster
  2025-06-05 17:33 ` [PATCH 1/7] iomap: move pos+len BUG_ON() to after folio lookup Brian Foster
  2025-06-05 17:33 ` [PATCH 2/7] filemap: add helper to look up dirty folios in a range Brian Foster
@ 2025-06-05 17:33 ` Brian Foster
  2025-06-09 16:04   ` Darrick J. Wong
  2025-06-10  4:27   ` Christoph Hellwig
  2025-06-05 17:33 ` [PATCH 4/7] xfs: always trim mapping to requested range for zero range Brian Foster
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 40+ messages in thread
From: Brian Foster @ 2025-06-05 17:33 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-xfs, linux-mm

The only way zero range can currently process unwritten mappings
with dirty pagecache is to check whether the range is dirty before
mapping lookup and then flush when at least one underlying mapping
is unwritten. This ordering is required to prevent iomap lookup from
racing with folio writeback and reclaim.

Since zero range can skip ranges of unwritten mappings that are
clean in cache, this operation can be improved by allowing the
filesystem to provide a set of dirty folios that require zeroing. In
turn, rather than flush or iterate file offsets, zero range can
iterate on folios in the batch and advance over clean or uncached
ranges in between.

Add a folio_batch in struct iomap and provide a helper for fs' to
populate the batch at lookup time. Update the folio lookup path to
return the next folio in the batch, if provided, and advance the
iter if the folio starts beyond the current offset.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/iomap/buffered-io.c | 73 +++++++++++++++++++++++++++++++++++++++---
 fs/iomap/iter.c        |  6 ++++
 include/linux/iomap.h  |  4 +++
 3 files changed, 78 insertions(+), 5 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 16499655e7b0..cf2f4f869920 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -750,6 +750,16 @@ static struct folio *__iomap_get_folio(struct iomap_iter *iter, size_t len)
 	if (!mapping_large_folio_support(iter->inode->i_mapping))
 		len = min_t(size_t, len, PAGE_SIZE - offset_in_page(pos));
 
+	if (iter->fbatch) {
+		struct folio *folio = folio_batch_next(iter->fbatch);
+
+		if (folio) {
+			folio_get(folio);
+			folio_lock(folio);
+		}
+		return folio;
+	}
+
 	if (folio_ops && folio_ops->get_folio)
 		return folio_ops->get_folio(iter, pos, len);
 	else
@@ -811,6 +821,8 @@ static int iomap_write_begin(struct iomap_iter *iter, struct folio **foliop,
 	int status = 0;
 
 	len = min_not_zero(len, *plen);
+	*foliop = NULL;
+	*plen = 0;
 
 	if (fatal_signal_pending(current))
 		return -EINTR;
@@ -819,6 +831,12 @@ static int iomap_write_begin(struct iomap_iter *iter, struct folio **foliop,
 	if (IS_ERR(folio))
 		return PTR_ERR(folio);
 
+	/* no folio means we're done with a batch */
+	if (!folio) {
+		WARN_ON_ONCE(!iter->fbatch);
+		return 0;
+	}
+
 	/*
 	 * Now we have a locked folio, before we do anything with it we need to
 	 * check that the iomap we have cached is not stale. The inode extent
@@ -839,6 +857,20 @@ static int iomap_write_begin(struct iomap_iter *iter, struct folio **foliop,
 		}
 	}
 
+	/*
+	 * folios in a batch may not be contiguous. If we've skipped forward,
+	 * advance the iter to the pos of the current folio. If the folio starts
+	 * beyond the end of the mapping, it may have been trimmed since the
+	 * lookup for whatever reason. Return a NULL folio to terminate the op.
+	 */
+	if (folio_pos(folio) > iter->pos) {
+		len = min_t(u64, folio_pos(folio) - iter->pos,
+				 iomap_length(iter));
+		status = iomap_iter_advance(iter, &len);
+		if (status || !len)
+			goto out_unlock;
+	}
+
 	pos = iomap_trim_folio_range(iter, folio, poffset, &len);
 	BUG_ON(pos + len > iter->iomap.offset + iter->iomap.length);
 	if (srcmap != &iter->iomap)
@@ -1380,6 +1412,12 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
 		if (iter->iomap.flags & IOMAP_F_STALE)
 			break;
 
+		/* a NULL folio means we're done with a folio batch */
+		if (!folio) {
+			status = iomap_iter_advance_full(iter);
+			break;
+		}
+
 		/* warn about zeroing folios beyond eof that won't write back */
 		WARN_ON_ONCE(folio_pos(folio) > iter->inode->i_size);
 
@@ -1401,6 +1439,26 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
 	return status;
 }
 
+loff_t
+iomap_fill_dirty_folios(
+	struct iomap_iter	*iter,
+	loff_t			offset,
+	loff_t			length)
+{
+	struct address_space	*mapping = iter->inode->i_mapping;
+	pgoff_t			start = offset >> PAGE_SHIFT;
+	pgoff_t			end = (offset + length - 1) >> PAGE_SHIFT;
+
+	iter->fbatch = kmalloc(sizeof(struct folio_batch), GFP_KERNEL);
+	if (!iter->fbatch)
+		return offset + length;
+	folio_batch_init(iter->fbatch);
+
+	filemap_get_folios_dirty(mapping, &start, end, iter->fbatch);
+	return (start << PAGE_SHIFT);
+}
+EXPORT_SYMBOL_GPL(iomap_fill_dirty_folios);
+
 int
 iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
 		const struct iomap_ops *ops, void *private)
@@ -1429,7 +1487,7 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
 	 * flushing on partial eof zeroing, special case it to zero the
 	 * unaligned start portion if already dirty in pagecache.
 	 */
-	if (off &&
+	if (!iter.fbatch && off &&
 	    filemap_range_needs_writeback(mapping, pos, pos + plen - 1)) {
 		iter.len = plen;
 		while ((ret = iomap_iter(&iter, ops)) > 0)
@@ -1445,13 +1503,18 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
 	 * if dirty and the fs returns a mapping that might convert on
 	 * writeback.
 	 */
-	range_dirty = filemap_range_needs_writeback(inode->i_mapping,
-					iter.pos, iter.pos + iter.len - 1);
+	range_dirty = filemap_range_needs_writeback(mapping, iter.pos,
+					iter.pos + iter.len - 1);
 	while ((ret = iomap_iter(&iter, ops)) > 0) {
 		const struct iomap *srcmap = iomap_iter_srcmap(&iter);
 
-		if (srcmap->type == IOMAP_HOLE ||
-		    srcmap->type == IOMAP_UNWRITTEN) {
+		if (WARN_ON_ONCE(iter.fbatch &&
+				 srcmap->type != IOMAP_UNWRITTEN))
+			return -EIO;
+
+		if (!iter.fbatch &&
+		    (srcmap->type == IOMAP_HOLE ||
+		     srcmap->type == IOMAP_UNWRITTEN)) {
 			s64 status;
 
 			if (range_dirty) {
diff --git a/fs/iomap/iter.c b/fs/iomap/iter.c
index 6ffc6a7b9ba5..89bd5951a6fd 100644
--- a/fs/iomap/iter.c
+++ b/fs/iomap/iter.c
@@ -9,6 +9,12 @@
 
 static inline void iomap_iter_reset_iomap(struct iomap_iter *iter)
 {
+	if (iter->fbatch) {
+		folio_batch_release(iter->fbatch);
+		kfree(iter->fbatch);
+		iter->fbatch = NULL;
+	}
+
 	iter->status = 0;
 	memset(&iter->iomap, 0, sizeof(iter->iomap));
 	memset(&iter->srcmap, 0, sizeof(iter->srcmap));
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 522644d62f30..0b9b460b2873 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -9,6 +9,7 @@
 #include <linux/types.h>
 #include <linux/mm_types.h>
 #include <linux/blkdev.h>
+#include <linux/pagevec.h>
 
 struct address_space;
 struct fiemap_extent_info;
@@ -239,6 +240,7 @@ struct iomap_iter {
 	unsigned flags;
 	struct iomap iomap;
 	struct iomap srcmap;
+	struct folio_batch *fbatch;
 	void *private;
 };
 
@@ -345,6 +347,8 @@ void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len);
 bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio);
 int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
 		const struct iomap_ops *ops);
+loff_t iomap_fill_dirty_folios(struct iomap_iter *iter, loff_t offset,
+		loff_t length);
 int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len,
 		bool *did_zero, const struct iomap_ops *ops, void *private);
 int iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 4/7] xfs: always trim mapping to requested range for zero range
  2025-06-05 17:33 [PATCH 0/7] iomap: zero range folio batch support Brian Foster
                   ` (2 preceding siblings ...)
  2025-06-05 17:33 ` [PATCH 3/7] iomap: optional zero range dirty folio processing Brian Foster
@ 2025-06-05 17:33 ` Brian Foster
  2025-06-09 16:07   ` Darrick J. Wong
  2025-06-05 17:33 ` [PATCH 5/7] xfs: fill dirty folios on zero range of unwritten mappings Brian Foster
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 40+ messages in thread
From: Brian Foster @ 2025-06-05 17:33 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-xfs, linux-mm

Refactor and tweak the IOMAP_ZERO logic in preparation to support
filling the folio batch for unwritten mappings. Drop the superfluous
imap offset check since the hole case has already been filtered out.
Split the the delalloc case handling into a sub-branch, and always
trim the imap to the requested offset/count so it can be more easily
used to bound the range to lookup in pagecache.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_iomap.c | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index ff05e6b1b0bb..b5cf5bc6308d 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1756,21 +1756,20 @@ xfs_buffered_write_iomap_begin(
 	}
 
 	/*
-	 * For zeroing, trim a delalloc extent that extends beyond the EOF
-	 * block.  If it starts beyond the EOF block, convert it to an
+	 * For zeroing, trim extents that extend beyond the EOF block. If a
+	 * delalloc extent starts beyond the EOF block, convert it to an
 	 * unwritten extent.
 	 */
-	if ((flags & IOMAP_ZERO) && imap.br_startoff <= offset_fsb &&
-	    isnullstartblock(imap.br_startblock)) {
+	if (flags & IOMAP_ZERO) {
 		xfs_fileoff_t eof_fsb = XFS_B_TO_FSB(mp, XFS_ISIZE(ip));
 
-		if (offset_fsb >= eof_fsb)
+		if (isnullstartblock(imap.br_startblock) &&
+		    offset_fsb >= eof_fsb)
 			goto convert_delay;
-		if (end_fsb > eof_fsb) {
+		if (offset_fsb < eof_fsb && end_fsb > eof_fsb)
 			end_fsb = eof_fsb;
-			xfs_trim_extent(&imap, offset_fsb,
-					end_fsb - offset_fsb);
-		}
+
+		xfs_trim_extent(&imap, offset_fsb, end_fsb - offset_fsb);
 	}
 
 	/*
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 5/7] xfs: fill dirty folios on zero range of unwritten mappings
  2025-06-05 17:33 [PATCH 0/7] iomap: zero range folio batch support Brian Foster
                   ` (3 preceding siblings ...)
  2025-06-05 17:33 ` [PATCH 4/7] xfs: always trim mapping to requested range for zero range Brian Foster
@ 2025-06-05 17:33 ` Brian Foster
  2025-06-06  2:02   ` kernel test robot
  2025-06-09 16:12   ` Darrick J. Wong
  2025-06-05 17:33 ` [PATCH 6/7] iomap: remove old partial eof zeroing optimization Brian Foster
  2025-06-05 17:33 ` [PATCH RFC 7/7] xfs: error tag to force zeroing on debug kernels Brian Foster
  6 siblings, 2 replies; 40+ messages in thread
From: Brian Foster @ 2025-06-05 17:33 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-xfs, linux-mm

Use the iomap folio batch mechanism to select folios to zero on zero
range of unwritten mappings. Trim the resulting mapping if the batch
is filled (unlikely for current use cases) to distinguish between a
range to skip and one that requires another iteration due to a full
batch.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_iomap.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index b5cf5bc6308d..63054f7ead0e 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1691,6 +1691,8 @@ xfs_buffered_write_iomap_begin(
 	struct iomap		*iomap,
 	struct iomap		*srcmap)
 {
+	struct iomap_iter	*iter = container_of(iomap, struct iomap_iter,
+						     iomap);
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct xfs_mount	*mp = ip->i_mount;
 	xfs_fileoff_t		offset_fsb = XFS_B_TO_FSBT(mp, offset);
@@ -1762,6 +1764,7 @@ xfs_buffered_write_iomap_begin(
 	 */
 	if (flags & IOMAP_ZERO) {
 		xfs_fileoff_t eof_fsb = XFS_B_TO_FSB(mp, XFS_ISIZE(ip));
+		u64 end;
 
 		if (isnullstartblock(imap.br_startblock) &&
 		    offset_fsb >= eof_fsb)
@@ -1769,6 +1772,26 @@ xfs_buffered_write_iomap_begin(
 		if (offset_fsb < eof_fsb && end_fsb > eof_fsb)
 			end_fsb = eof_fsb;
 
+		/*
+		 * Look up dirty folios for unwritten mappings within EOF.
+		 * Providing this bypasses the flush iomap uses to trigger
+		 * extent conversion when unwritten mappings have dirty
+		 * pagecache in need of zeroing.
+		 *
+		 * Trim the mapping to the end pos of the lookup, which in turn
+		 * was trimmed to the end of the batch if it became full before
+		 * the end of the mapping.
+		 */
+		if (imap.br_state == XFS_EXT_UNWRITTEN &&
+		    offset_fsb < eof_fsb) {
+			loff_t len = min(count,
+					 XFS_FSB_TO_B(mp, imap.br_blockcount));
+
+			end = iomap_fill_dirty_folios(iter, offset, len);
+			end_fsb = min_t(xfs_fileoff_t, end_fsb,
+					XFS_B_TO_FSB(mp, end));
+		}
+
 		xfs_trim_extent(&imap, offset_fsb, end_fsb - offset_fsb);
 	}
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 6/7] iomap: remove old partial eof zeroing optimization
  2025-06-05 17:33 [PATCH 0/7] iomap: zero range folio batch support Brian Foster
                   ` (4 preceding siblings ...)
  2025-06-05 17:33 ` [PATCH 5/7] xfs: fill dirty folios on zero range of unwritten mappings Brian Foster
@ 2025-06-05 17:33 ` Brian Foster
  2025-06-10  4:32   ` Christoph Hellwig
  2025-06-05 17:33 ` [PATCH RFC 7/7] xfs: error tag to force zeroing on debug kernels Brian Foster
  6 siblings, 1 reply; 40+ messages in thread
From: Brian Foster @ 2025-06-05 17:33 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-xfs, linux-mm

iomap_zero_range() optimizes the partial eof block zeroing use case
by force zeroing if the mapping is dirty. This is to avoid frequent
flushing on file extending workloads, which hurts performance.

Now that the folio batch mechanism provides a more generic solution
and is used by the only real zero range user (XFS), this isolated
optimization is no longer needed. Remove the unnecessary code and
let callers use the folio batch or fall back to flushing by default.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/iomap/buffered-io.c | 24 ------------------------
 1 file changed, 24 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index cf2f4f869920..ec58f5dae71c 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1471,33 +1471,9 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
 		.private	= private,
 	};
 	struct address_space *mapping = inode->i_mapping;
-	unsigned int blocksize = i_blocksize(inode);
-	unsigned int off = pos & (blocksize - 1);
-	loff_t plen = min_t(loff_t, len, blocksize - off);
 	int ret;
 	bool range_dirty;
 
-	/*
-	 * Zero range can skip mappings that are zero on disk so long as
-	 * pagecache is clean. If pagecache was dirty prior to zero range, the
-	 * mapping converts on writeback completion and so must be zeroed.
-	 *
-	 * The simplest way to deal with this across a range is to flush
-	 * pagecache and process the updated mappings. To avoid excessive
-	 * flushing on partial eof zeroing, special case it to zero the
-	 * unaligned start portion if already dirty in pagecache.
-	 */
-	if (!iter.fbatch && off &&
-	    filemap_range_needs_writeback(mapping, pos, pos + plen - 1)) {
-		iter.len = plen;
-		while ((ret = iomap_iter(&iter, ops)) > 0)
-			iter.status = iomap_zero_iter(&iter, did_zero);
-
-		iter.len = len - (iter.pos - pos);
-		if (ret || !iter.len)
-			return ret;
-	}
-
 	/*
 	 * To avoid an unconditional flush, check pagecache state and only flush
 	 * if dirty and the fs returns a mapping that might convert on
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH RFC 7/7] xfs: error tag to force zeroing on debug kernels
  2025-06-05 17:33 [PATCH 0/7] iomap: zero range folio batch support Brian Foster
                   ` (5 preceding siblings ...)
  2025-06-05 17:33 ` [PATCH 6/7] iomap: remove old partial eof zeroing optimization Brian Foster
@ 2025-06-05 17:33 ` Brian Foster
  2025-06-10  4:33   ` Christoph Hellwig
  6 siblings, 1 reply; 40+ messages in thread
From: Brian Foster @ 2025-06-05 17:33 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-xfs, linux-mm

iomap_zero_range() has to cover various corner cases that are
difficult to test on production kernels because it is used in fairly
limited use cases. For example, it is currently only used by XFS and
mostly only in partial block zeroing cases.

While it's possible to test most of these functional cases, we can
provide more robust test coverage by co-opting fallocate zero range
to invoke zeroing of the entire range instead of the more efficient
block punch/allocate sequence. Add an errortag to occasionally
invoke forced zeroing.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---

Hi all,

My original test approach involved a hack that redirected fallocate zero
range to iomap. The easiest way to incorporate something like that in
XFS would probably be via a randomized debug branch in the same path
(i.e. similar to what is done in the block/inode allocation paths), but
I find it a little annoying that there's currently no great way to run a
full fstests run with certain error tags enabled.

So for that I was playing around with some fstests hacks to hook into
the device mount paths and enable certain errortags immediately on each
mount. Obviously this implicitly depends on XFS_DEBUG plus the specific
errortag support, but otherwise I'd probably propose something simple
like to enable or disable the whole thing with an XFS_ERRORS_ON=1 type
setting or some such.

So that's the context for this patch. I'm curious if there are any
thoughts..? Another simple option might be to enable certain errortags
by default on XFS_DEBUG, or with XFS_DEBUG plus some new kernel config
option (i.e. XFS_DEBUG_ERRORTAGS). That would be less code in fstests
and perhaps facilitate external testing, but we'd still need to track
which tags get enabled by default in that mode. OTOH, if you consider
the couple or so open coded get_random_u32_below(2) branches we do have,
these are in theory just errortags that happen to be on by default on
XFS_DEBUG kernels (without the log noise). Thoughts?

Brian

 fs/xfs/libxfs/xfs_errortag.h |  4 +++-
 fs/xfs/xfs_error.c           |  3 +++
 fs/xfs/xfs_file.c            | 21 +++++++++++++++------
 3 files changed, 21 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_errortag.h b/fs/xfs/libxfs/xfs_errortag.h
index a53c5d40e084..33ca3fc2ca88 100644
--- a/fs/xfs/libxfs/xfs_errortag.h
+++ b/fs/xfs/libxfs/xfs_errortag.h
@@ -65,7 +65,8 @@
 #define XFS_ERRTAG_WRITE_DELAY_MS			43
 #define XFS_ERRTAG_EXCHMAPS_FINISH_ONE			44
 #define XFS_ERRTAG_METAFILE_RESV_CRITICAL		45
-#define XFS_ERRTAG_MAX					46
+#define XFS_ERRTAG_FORCE_ZERO_RANGE			46
+#define XFS_ERRTAG_MAX					47

 /*
  * Random factors for above tags, 1 means always, 2 means 1/2 time, etc.
@@ -115,5 +116,6 @@
 #define XFS_RANDOM_WRITE_DELAY_MS			3000
 #define XFS_RANDOM_EXCHMAPS_FINISH_ONE			1
 #define XFS_RANDOM_METAFILE_RESV_CRITICAL		4
+#define XFS_RANDOM_FORCE_ZERO_RANGE			4

 #endif /* __XFS_ERRORTAG_H_ */
diff --git a/fs/xfs/xfs_error.c b/fs/xfs/xfs_error.c
index dbd87e137694..00c0c391c329 100644
--- a/fs/xfs/xfs_error.c
+++ b/fs/xfs/xfs_error.c
@@ -64,6 +64,7 @@ static unsigned int xfs_errortag_random_default[] = {
 	XFS_RANDOM_WRITE_DELAY_MS,
 	XFS_RANDOM_EXCHMAPS_FINISH_ONE,
 	XFS_RANDOM_METAFILE_RESV_CRITICAL,
+	XFS_RANDOM_FORCE_ZERO_RANGE,
 };

 struct xfs_errortag_attr {
@@ -183,6 +184,7 @@ XFS_ERRORTAG_ATTR_RW(wb_delay_ms,	XFS_ERRTAG_WB_DELAY_MS);
 XFS_ERRORTAG_ATTR_RW(write_delay_ms,	XFS_ERRTAG_WRITE_DELAY_MS);
 XFS_ERRORTAG_ATTR_RW(exchmaps_finish_one, XFS_ERRTAG_EXCHMAPS_FINISH_ONE);
 XFS_ERRORTAG_ATTR_RW(metafile_resv_crit, XFS_ERRTAG_METAFILE_RESV_CRITICAL);
+XFS_ERRORTAG_ATTR_RW(force_zero_range, XFS_ERRTAG_FORCE_ZERO_RANGE);

 static struct attribute *xfs_errortag_attrs[] = {
 	XFS_ERRORTAG_ATTR_LIST(noerror),
@@ -230,6 +232,7 @@ static struct attribute *xfs_errortag_attrs[] = {
 	XFS_ERRORTAG_ATTR_LIST(write_delay_ms),
 	XFS_ERRORTAG_ATTR_LIST(exchmaps_finish_one),
 	XFS_ERRORTAG_ATTR_LIST(metafile_resv_crit),
+	XFS_ERRORTAG_ATTR_LIST(force_zero_range),
 	NULL,
 };
 ATTRIBUTE_GROUPS(xfs_errortag);
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 48254a72071b..88ce754ef226 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -27,6 +27,8 @@
 #include "xfs_file.h"
 #include "xfs_aops.h"
 #include "xfs_zone_alloc.h"
+#include "xfs_error.h"
+#include "xfs_errortag.h"

 #include <linux/dax.h>
 #include <linux/falloc.h>
@@ -1269,13 +1271,20 @@ xfs_falloc_zero_range(
 	if (error)
 		return error;

-	error = xfs_free_file_space(XFS_I(inode), offset, len, ac);
-	if (error)
-		return error;
+	/* randomly force zeroing to exercise zero range */
+	if (XFS_TEST_ERROR(false, XFS_I(inode)->i_mount,
+			   XFS_ERRTAG_FORCE_ZERO_RANGE)) {
+		error = xfs_zero_range(XFS_I(inode), offset, len, ac, NULL);
+	} else {
+		error = xfs_free_file_space(XFS_I(inode), offset, len, ac);
+		if (error)
+			return error;

-	len = round_up(offset + len, blksize) - round_down(offset, blksize);
-	offset = round_down(offset, blksize);
-	error = xfs_alloc_file_space(XFS_I(inode), offset, len);
+		len = round_up(offset + len, blksize) -
+			round_down(offset, blksize);
+		offset = round_down(offset, blksize);
+		error = xfs_alloc_file_space(XFS_I(inode), offset, len);
+	}
 	if (error)
 		return error;
 	return xfs_falloc_setsize(file, new_size);
-- 
2.49.0

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 5/7] xfs: fill dirty folios on zero range of unwritten mappings
  2025-06-05 17:33 ` [PATCH 5/7] xfs: fill dirty folios on zero range of unwritten mappings Brian Foster
@ 2025-06-06  2:02   ` kernel test robot
  2025-06-06 15:20     ` Brian Foster
  2025-06-09 16:12   ` Darrick J. Wong
  1 sibling, 1 reply; 40+ messages in thread
From: kernel test robot @ 2025-06-06  2:02 UTC (permalink / raw)
  To: Brian Foster, linux-fsdevel; +Cc: oe-kbuild-all, linux-xfs, linux-mm

Hi Brian,

kernel test robot noticed the following build errors:

[auto build test ERROR on brauner-vfs/vfs.all]
[also build test ERROR on akpm-mm/mm-everything linus/master next-20250605]
[cannot apply to xfs-linux/for-next v6.15]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Brian-Foster/iomap-move-pos-len-BUG_ON-to-after-folio-lookup/20250606-013227
base:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git vfs.all
patch link:    https://lore.kernel.org/r/20250605173357.579720-6-bfoster%40redhat.com
patch subject: [PATCH 5/7] xfs: fill dirty folios on zero range of unwritten mappings
config: i386-buildonly-randconfig-003-20250606 (https://download.01.org/0day-ci/archive/20250606/202506060903.vM8I4O0S-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250606/202506060903.vM8I4O0S-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202506060903.vM8I4O0S-lkp@intel.com/

All error/warnings (new ones prefixed by >>):

   fs/xfs/xfs_iomap.c: In function 'xfs_buffered_write_iomap_begin':
>> fs/xfs/xfs_iomap.c:1602:55: error: 'iter' undeclared (first use in this function)
    1602 |                         end = iomap_fill_dirty_folios(iter, offset, len);
         |                                                       ^~~~
   fs/xfs/xfs_iomap.c:1602:55: note: each undeclared identifier is reported only once for each function it appears in
   fs/xfs/xfs_iomap.c: In function 'xfs_seek_iomap_begin':
>> fs/xfs/xfs_iomap.c:1893:34: warning: unused variable 'iter' [-Wunused-variable]
    1893 |         struct iomap_iter       *iter = container_of(iomap, struct iomap_iter,
         |                                  ^~~~


vim +/iter +1602 fs/xfs/xfs_iomap.c

  1498	
  1499	static int
  1500	xfs_buffered_write_iomap_begin(
  1501		struct inode		*inode,
  1502		loff_t			offset,
  1503		loff_t			count,
  1504		unsigned		flags,
  1505		struct iomap		*iomap,
  1506		struct iomap		*srcmap)
  1507	{
  1508		struct xfs_inode	*ip = XFS_I(inode);
  1509		struct xfs_mount	*mp = ip->i_mount;
  1510		xfs_fileoff_t		offset_fsb = XFS_B_TO_FSBT(mp, offset);
  1511		xfs_fileoff_t		end_fsb = xfs_iomap_end_fsb(mp, offset, count);
  1512		struct xfs_bmbt_irec	imap, cmap;
  1513		struct xfs_iext_cursor	icur, ccur;
  1514		xfs_fsblock_t		prealloc_blocks = 0;
  1515		bool			eof = false, cow_eof = false, shared = false;
  1516		int			allocfork = XFS_DATA_FORK;
  1517		int			error = 0;
  1518		unsigned int		lockmode = XFS_ILOCK_EXCL;
  1519		unsigned int		iomap_flags = 0;
  1520		u64			seq;
  1521	
  1522		if (xfs_is_shutdown(mp))
  1523			return -EIO;
  1524	
  1525		if (xfs_is_zoned_inode(ip))
  1526			return xfs_zoned_buffered_write_iomap_begin(inode, offset,
  1527					count, flags, iomap, srcmap);
  1528	
  1529		/* we can't use delayed allocations when using extent size hints */
  1530		if (xfs_get_extsz_hint(ip))
  1531			return xfs_direct_write_iomap_begin(inode, offset, count,
  1532					flags, iomap, srcmap);
  1533	
  1534		error = xfs_qm_dqattach(ip);
  1535		if (error)
  1536			return error;
  1537	
  1538		error = xfs_ilock_for_iomap(ip, flags, &lockmode);
  1539		if (error)
  1540			return error;
  1541	
  1542		if (XFS_IS_CORRUPT(mp, !xfs_ifork_has_extents(&ip->i_df)) ||
  1543		    XFS_TEST_ERROR(false, mp, XFS_ERRTAG_BMAPIFORMAT)) {
  1544			xfs_bmap_mark_sick(ip, XFS_DATA_FORK);
  1545			error = -EFSCORRUPTED;
  1546			goto out_unlock;
  1547		}
  1548	
  1549		XFS_STATS_INC(mp, xs_blk_mapw);
  1550	
  1551		error = xfs_iread_extents(NULL, ip, XFS_DATA_FORK);
  1552		if (error)
  1553			goto out_unlock;
  1554	
  1555		/*
  1556		 * Search the data fork first to look up our source mapping.  We
  1557		 * always need the data fork map, as we have to return it to the
  1558		 * iomap code so that the higher level write code can read data in to
  1559		 * perform read-modify-write cycles for unaligned writes.
  1560		 */
  1561		eof = !xfs_iext_lookup_extent(ip, &ip->i_df, offset_fsb, &icur, &imap);
  1562		if (eof)
  1563			imap.br_startoff = end_fsb; /* fake hole until the end */
  1564	
  1565		/* We never need to allocate blocks for zeroing or unsharing a hole. */
  1566		if ((flags & (IOMAP_UNSHARE | IOMAP_ZERO)) &&
  1567		    imap.br_startoff > offset_fsb) {
  1568			xfs_hole_to_iomap(ip, iomap, offset_fsb, imap.br_startoff);
  1569			goto out_unlock;
  1570		}
  1571	
  1572		/*
  1573		 * For zeroing, trim extents that extend beyond the EOF block. If a
  1574		 * delalloc extent starts beyond the EOF block, convert it to an
  1575		 * unwritten extent.
  1576		 */
  1577		if (flags & IOMAP_ZERO) {
  1578			xfs_fileoff_t eof_fsb = XFS_B_TO_FSB(mp, XFS_ISIZE(ip));
  1579			u64 end;
  1580	
  1581			if (isnullstartblock(imap.br_startblock) &&
  1582			    offset_fsb >= eof_fsb)
  1583				goto convert_delay;
  1584			if (offset_fsb < eof_fsb && end_fsb > eof_fsb)
  1585				end_fsb = eof_fsb;
  1586	
  1587			/*
  1588			 * Look up dirty folios for unwritten mappings within EOF.
  1589			 * Providing this bypasses the flush iomap uses to trigger
  1590			 * extent conversion when unwritten mappings have dirty
  1591			 * pagecache in need of zeroing.
  1592			 *
  1593			 * Trim the mapping to the end pos of the lookup, which in turn
  1594			 * was trimmed to the end of the batch if it became full before
  1595			 * the end of the mapping.
  1596			 */
  1597			if (imap.br_state == XFS_EXT_UNWRITTEN &&
  1598			    offset_fsb < eof_fsb) {
  1599				loff_t len = min(count,
  1600						 XFS_FSB_TO_B(mp, imap.br_blockcount));
  1601	
> 1602				end = iomap_fill_dirty_folios(iter, offset, len);
  1603				end_fsb = min_t(xfs_fileoff_t, end_fsb,
  1604						XFS_B_TO_FSB(mp, end));
  1605			}
  1606	
  1607			xfs_trim_extent(&imap, offset_fsb, end_fsb - offset_fsb);
  1608		}
  1609	
  1610		/*
  1611		 * Search the COW fork extent list even if we did not find a data fork
  1612		 * extent.  This serves two purposes: first this implements the
  1613		 * speculative preallocation using cowextsize, so that we also unshare
  1614		 * block adjacent to shared blocks instead of just the shared blocks
  1615		 * themselves.  Second the lookup in the extent list is generally faster
  1616		 * than going out to the shared extent tree.
  1617		 */
  1618		if (xfs_is_cow_inode(ip)) {
  1619			if (!ip->i_cowfp) {
  1620				ASSERT(!xfs_is_reflink_inode(ip));
  1621				xfs_ifork_init_cow(ip);
  1622			}
  1623			cow_eof = !xfs_iext_lookup_extent(ip, ip->i_cowfp, offset_fsb,
  1624					&ccur, &cmap);
  1625			if (!cow_eof && cmap.br_startoff <= offset_fsb) {
  1626				trace_xfs_reflink_cow_found(ip, &cmap);
  1627				goto found_cow;
  1628			}
  1629		}
  1630	
  1631		if (imap.br_startoff <= offset_fsb) {
  1632			/*
  1633			 * For reflink files we may need a delalloc reservation when
  1634			 * overwriting shared extents.   This includes zeroing of
  1635			 * existing extents that contain data.
  1636			 */
  1637			if (!xfs_is_cow_inode(ip) ||
  1638			    ((flags & IOMAP_ZERO) && imap.br_state != XFS_EXT_NORM)) {
  1639				trace_xfs_iomap_found(ip, offset, count, XFS_DATA_FORK,
  1640						&imap);
  1641				goto found_imap;
  1642			}
  1643	
  1644			xfs_trim_extent(&imap, offset_fsb, end_fsb - offset_fsb);
  1645	
  1646			/* Trim the mapping to the nearest shared extent boundary. */
  1647			error = xfs_bmap_trim_cow(ip, &imap, &shared);
  1648			if (error)
  1649				goto out_unlock;
  1650	
  1651			/* Not shared?  Just report the (potentially capped) extent. */
  1652			if (!shared) {
  1653				trace_xfs_iomap_found(ip, offset, count, XFS_DATA_FORK,
  1654						&imap);
  1655				goto found_imap;
  1656			}
  1657	
  1658			/*
  1659			 * Fork all the shared blocks from our write offset until the
  1660			 * end of the extent.
  1661			 */
  1662			allocfork = XFS_COW_FORK;
  1663			end_fsb = imap.br_startoff + imap.br_blockcount;
  1664		} else {
  1665			/*
  1666			 * We cap the maximum length we map here to MAX_WRITEBACK_PAGES
  1667			 * pages to keep the chunks of work done where somewhat
  1668			 * symmetric with the work writeback does.  This is a completely
  1669			 * arbitrary number pulled out of thin air.
  1670			 *
  1671			 * Note that the values needs to be less than 32-bits wide until
  1672			 * the lower level functions are updated.
  1673			 */
  1674			count = min_t(loff_t, count, 1024 * PAGE_SIZE);
  1675			end_fsb = xfs_iomap_end_fsb(mp, offset, count);
  1676	
  1677			if (xfs_is_always_cow_inode(ip))
  1678				allocfork = XFS_COW_FORK;
  1679		}
  1680	
  1681		if (eof && offset + count > XFS_ISIZE(ip)) {
  1682			/*
  1683			 * Determine the initial size of the preallocation.
  1684			 * We clean up any extra preallocation when the file is closed.
  1685			 */
  1686			if (xfs_has_allocsize(mp))
  1687				prealloc_blocks = mp->m_allocsize_blocks;
  1688			else if (allocfork == XFS_DATA_FORK)
  1689				prealloc_blocks = xfs_iomap_prealloc_size(ip, allocfork,
  1690							offset, count, &icur);
  1691			else
  1692				prealloc_blocks = xfs_iomap_prealloc_size(ip, allocfork,
  1693							offset, count, &ccur);
  1694			if (prealloc_blocks) {
  1695				xfs_extlen_t	align;
  1696				xfs_off_t	end_offset;
  1697				xfs_fileoff_t	p_end_fsb;
  1698	
  1699				end_offset = XFS_ALLOC_ALIGN(mp, offset + count - 1);
  1700				p_end_fsb = XFS_B_TO_FSBT(mp, end_offset) +
  1701						prealloc_blocks;
  1702	
  1703				align = xfs_eof_alignment(ip);
  1704				if (align)
  1705					p_end_fsb = roundup_64(p_end_fsb, align);
  1706	
  1707				p_end_fsb = min(p_end_fsb,
  1708					XFS_B_TO_FSB(mp, mp->m_super->s_maxbytes));
  1709				ASSERT(p_end_fsb > offset_fsb);
  1710				prealloc_blocks = p_end_fsb - end_fsb;
  1711			}
  1712		}
  1713	
  1714		/*
  1715		 * Flag newly allocated delalloc blocks with IOMAP_F_NEW so we punch
  1716		 * them out if the write happens to fail.
  1717		 */
  1718		iomap_flags |= IOMAP_F_NEW;
  1719		if (allocfork == XFS_COW_FORK) {
  1720			error = xfs_bmapi_reserve_delalloc(ip, allocfork, offset_fsb,
  1721					end_fsb - offset_fsb, prealloc_blocks, &cmap,
  1722					&ccur, cow_eof);
  1723			if (error)
  1724				goto out_unlock;
  1725	
  1726			trace_xfs_iomap_alloc(ip, offset, count, allocfork, &cmap);
  1727			goto found_cow;
  1728		}
  1729	
  1730		error = xfs_bmapi_reserve_delalloc(ip, allocfork, offset_fsb,
  1731				end_fsb - offset_fsb, prealloc_blocks, &imap, &icur,
  1732				eof);
  1733		if (error)
  1734			goto out_unlock;
  1735	
  1736		trace_xfs_iomap_alloc(ip, offset, count, allocfork, &imap);
  1737	found_imap:
  1738		seq = xfs_iomap_inode_sequence(ip, iomap_flags);
  1739		xfs_iunlock(ip, lockmode);
  1740		return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, iomap_flags, seq);
  1741	
  1742	convert_delay:
  1743		xfs_iunlock(ip, lockmode);
  1744		truncate_pagecache(inode, offset);
  1745		error = xfs_bmapi_convert_delalloc(ip, XFS_DATA_FORK, offset,
  1746						   iomap, NULL);
  1747		if (error)
  1748			return error;
  1749	
  1750		trace_xfs_iomap_alloc(ip, offset, count, XFS_DATA_FORK, &imap);
  1751		return 0;
  1752	
  1753	found_cow:
  1754		if (imap.br_startoff <= offset_fsb) {
  1755			error = xfs_bmbt_to_iomap(ip, srcmap, &imap, flags, 0,
  1756					xfs_iomap_inode_sequence(ip, 0));
  1757			if (error)
  1758				goto out_unlock;
  1759		} else {
  1760			xfs_trim_extent(&cmap, offset_fsb,
  1761					imap.br_startoff - offset_fsb);
  1762		}
  1763	
  1764		iomap_flags |= IOMAP_F_SHARED;
  1765		seq = xfs_iomap_inode_sequence(ip, iomap_flags);
  1766		xfs_iunlock(ip, lockmode);
  1767		return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags, iomap_flags, seq);
  1768	
  1769	out_unlock:
  1770		xfs_iunlock(ip, lockmode);
  1771		return error;
  1772	}
  1773	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 5/7] xfs: fill dirty folios on zero range of unwritten mappings
  2025-06-06  2:02   ` kernel test robot
@ 2025-06-06 15:20     ` Brian Foster
  0 siblings, 0 replies; 40+ messages in thread
From: Brian Foster @ 2025-06-06 15:20 UTC (permalink / raw)
  To: kernel test robot; +Cc: linux-fsdevel, oe-kbuild-all, linux-xfs, linux-mm

On Fri, Jun 06, 2025 at 10:02:34AM +0800, kernel test robot wrote:
> Hi Brian,
> 
> kernel test robot noticed the following build errors:
> 
> [auto build test ERROR on brauner-vfs/vfs.all]
> [also build test ERROR on akpm-mm/mm-everything linus/master next-20250605]
> [cannot apply to xfs-linux/for-next v6.15]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
> 
> url:    https://github.com/intel-lab-lkp/linux/commits/Brian-Foster/iomap-move-pos-len-BUG_ON-to-after-folio-lookup/20250606-013227
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git vfs.all
> patch link:    https://lore.kernel.org/r/20250605173357.579720-6-bfoster%40redhat.com
> patch subject: [PATCH 5/7] xfs: fill dirty folios on zero range of unwritten mappings
> config: i386-buildonly-randconfig-003-20250606 (https://download.01.org/0day-ci/archive/20250606/202506060903.vM8I4O0S-lkp@intel.com/config)
> compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250606/202506060903.vM8I4O0S-lkp@intel.com/reproduce)
> 

The series is currently based on latest master. For some reason when
applied to vfs.all, the iter variable hunk of this patch applies to the
wrong function.

I'm not 100% sure what the conflict is, but if I had to guess after a
quick look at both branches, master looks like it has XFS atomic writes
bits pulled in that touch this area.

Anyways I don't know if the robots expect a different base here given
the combination of vfs (iomap), xfs, and mm, but if nothing else I'll
see if this resolves by the time a v2 comes around...

Brian

> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202506060903.vM8I4O0S-lkp@intel.com/
> 
> All error/warnings (new ones prefixed by >>):
> 
>    fs/xfs/xfs_iomap.c: In function 'xfs_buffered_write_iomap_begin':
> >> fs/xfs/xfs_iomap.c:1602:55: error: 'iter' undeclared (first use in this function)
>     1602 |                         end = iomap_fill_dirty_folios(iter, offset, len);
>          |                                                       ^~~~
>    fs/xfs/xfs_iomap.c:1602:55: note: each undeclared identifier is reported only once for each function it appears in
>    fs/xfs/xfs_iomap.c: In function 'xfs_seek_iomap_begin':
> >> fs/xfs/xfs_iomap.c:1893:34: warning: unused variable 'iter' [-Wunused-variable]
>     1893 |         struct iomap_iter       *iter = container_of(iomap, struct iomap_iter,
>          |                                  ^~~~
> 
> 
> vim +/iter +1602 fs/xfs/xfs_iomap.c
> 
>   1498	
>   1499	static int
>   1500	xfs_buffered_write_iomap_begin(
>   1501		struct inode		*inode,
>   1502		loff_t			offset,
>   1503		loff_t			count,
>   1504		unsigned		flags,
>   1505		struct iomap		*iomap,
>   1506		struct iomap		*srcmap)
>   1507	{
>   1508		struct xfs_inode	*ip = XFS_I(inode);
>   1509		struct xfs_mount	*mp = ip->i_mount;
>   1510		xfs_fileoff_t		offset_fsb = XFS_B_TO_FSBT(mp, offset);
>   1511		xfs_fileoff_t		end_fsb = xfs_iomap_end_fsb(mp, offset, count);
>   1512		struct xfs_bmbt_irec	imap, cmap;
>   1513		struct xfs_iext_cursor	icur, ccur;
>   1514		xfs_fsblock_t		prealloc_blocks = 0;
>   1515		bool			eof = false, cow_eof = false, shared = false;
>   1516		int			allocfork = XFS_DATA_FORK;
>   1517		int			error = 0;
>   1518		unsigned int		lockmode = XFS_ILOCK_EXCL;
>   1519		unsigned int		iomap_flags = 0;
>   1520		u64			seq;
>   1521	
>   1522		if (xfs_is_shutdown(mp))
>   1523			return -EIO;
>   1524	
>   1525		if (xfs_is_zoned_inode(ip))
>   1526			return xfs_zoned_buffered_write_iomap_begin(inode, offset,
>   1527					count, flags, iomap, srcmap);
>   1528	
>   1529		/* we can't use delayed allocations when using extent size hints */
>   1530		if (xfs_get_extsz_hint(ip))
>   1531			return xfs_direct_write_iomap_begin(inode, offset, count,
>   1532					flags, iomap, srcmap);
>   1533	
>   1534		error = xfs_qm_dqattach(ip);
>   1535		if (error)
>   1536			return error;
>   1537	
>   1538		error = xfs_ilock_for_iomap(ip, flags, &lockmode);
>   1539		if (error)
>   1540			return error;
>   1541	
>   1542		if (XFS_IS_CORRUPT(mp, !xfs_ifork_has_extents(&ip->i_df)) ||
>   1543		    XFS_TEST_ERROR(false, mp, XFS_ERRTAG_BMAPIFORMAT)) {
>   1544			xfs_bmap_mark_sick(ip, XFS_DATA_FORK);
>   1545			error = -EFSCORRUPTED;
>   1546			goto out_unlock;
>   1547		}
>   1548	
>   1549		XFS_STATS_INC(mp, xs_blk_mapw);
>   1550	
>   1551		error = xfs_iread_extents(NULL, ip, XFS_DATA_FORK);
>   1552		if (error)
>   1553			goto out_unlock;
>   1554	
>   1555		/*
>   1556		 * Search the data fork first to look up our source mapping.  We
>   1557		 * always need the data fork map, as we have to return it to the
>   1558		 * iomap code so that the higher level write code can read data in to
>   1559		 * perform read-modify-write cycles for unaligned writes.
>   1560		 */
>   1561		eof = !xfs_iext_lookup_extent(ip, &ip->i_df, offset_fsb, &icur, &imap);
>   1562		if (eof)
>   1563			imap.br_startoff = end_fsb; /* fake hole until the end */
>   1564	
>   1565		/* We never need to allocate blocks for zeroing or unsharing a hole. */
>   1566		if ((flags & (IOMAP_UNSHARE | IOMAP_ZERO)) &&
>   1567		    imap.br_startoff > offset_fsb) {
>   1568			xfs_hole_to_iomap(ip, iomap, offset_fsb, imap.br_startoff);
>   1569			goto out_unlock;
>   1570		}
>   1571	
>   1572		/*
>   1573		 * For zeroing, trim extents that extend beyond the EOF block. If a
>   1574		 * delalloc extent starts beyond the EOF block, convert it to an
>   1575		 * unwritten extent.
>   1576		 */
>   1577		if (flags & IOMAP_ZERO) {
>   1578			xfs_fileoff_t eof_fsb = XFS_B_TO_FSB(mp, XFS_ISIZE(ip));
>   1579			u64 end;
>   1580	
>   1581			if (isnullstartblock(imap.br_startblock) &&
>   1582			    offset_fsb >= eof_fsb)
>   1583				goto convert_delay;
>   1584			if (offset_fsb < eof_fsb && end_fsb > eof_fsb)
>   1585				end_fsb = eof_fsb;
>   1586	
>   1587			/*
>   1588			 * Look up dirty folios for unwritten mappings within EOF.
>   1589			 * Providing this bypasses the flush iomap uses to trigger
>   1590			 * extent conversion when unwritten mappings have dirty
>   1591			 * pagecache in need of zeroing.
>   1592			 *
>   1593			 * Trim the mapping to the end pos of the lookup, which in turn
>   1594			 * was trimmed to the end of the batch if it became full before
>   1595			 * the end of the mapping.
>   1596			 */
>   1597			if (imap.br_state == XFS_EXT_UNWRITTEN &&
>   1598			    offset_fsb < eof_fsb) {
>   1599				loff_t len = min(count,
>   1600						 XFS_FSB_TO_B(mp, imap.br_blockcount));
>   1601	
> > 1602				end = iomap_fill_dirty_folios(iter, offset, len);
>   1603				end_fsb = min_t(xfs_fileoff_t, end_fsb,
>   1604						XFS_B_TO_FSB(mp, end));
>   1605			}
>   1606	
>   1607			xfs_trim_extent(&imap, offset_fsb, end_fsb - offset_fsb);
>   1608		}
>   1609	
>   1610		/*
>   1611		 * Search the COW fork extent list even if we did not find a data fork
>   1612		 * extent.  This serves two purposes: first this implements the
>   1613		 * speculative preallocation using cowextsize, so that we also unshare
>   1614		 * block adjacent to shared blocks instead of just the shared blocks
>   1615		 * themselves.  Second the lookup in the extent list is generally faster
>   1616		 * than going out to the shared extent tree.
>   1617		 */
>   1618		if (xfs_is_cow_inode(ip)) {
>   1619			if (!ip->i_cowfp) {
>   1620				ASSERT(!xfs_is_reflink_inode(ip));
>   1621				xfs_ifork_init_cow(ip);
>   1622			}
>   1623			cow_eof = !xfs_iext_lookup_extent(ip, ip->i_cowfp, offset_fsb,
>   1624					&ccur, &cmap);
>   1625			if (!cow_eof && cmap.br_startoff <= offset_fsb) {
>   1626				trace_xfs_reflink_cow_found(ip, &cmap);
>   1627				goto found_cow;
>   1628			}
>   1629		}
>   1630	
>   1631		if (imap.br_startoff <= offset_fsb) {
>   1632			/*
>   1633			 * For reflink files we may need a delalloc reservation when
>   1634			 * overwriting shared extents.   This includes zeroing of
>   1635			 * existing extents that contain data.
>   1636			 */
>   1637			if (!xfs_is_cow_inode(ip) ||
>   1638			    ((flags & IOMAP_ZERO) && imap.br_state != XFS_EXT_NORM)) {
>   1639				trace_xfs_iomap_found(ip, offset, count, XFS_DATA_FORK,
>   1640						&imap);
>   1641				goto found_imap;
>   1642			}
>   1643	
>   1644			xfs_trim_extent(&imap, offset_fsb, end_fsb - offset_fsb);
>   1645	
>   1646			/* Trim the mapping to the nearest shared extent boundary. */
>   1647			error = xfs_bmap_trim_cow(ip, &imap, &shared);
>   1648			if (error)
>   1649				goto out_unlock;
>   1650	
>   1651			/* Not shared?  Just report the (potentially capped) extent. */
>   1652			if (!shared) {
>   1653				trace_xfs_iomap_found(ip, offset, count, XFS_DATA_FORK,
>   1654						&imap);
>   1655				goto found_imap;
>   1656			}
>   1657	
>   1658			/*
>   1659			 * Fork all the shared blocks from our write offset until the
>   1660			 * end of the extent.
>   1661			 */
>   1662			allocfork = XFS_COW_FORK;
>   1663			end_fsb = imap.br_startoff + imap.br_blockcount;
>   1664		} else {
>   1665			/*
>   1666			 * We cap the maximum length we map here to MAX_WRITEBACK_PAGES
>   1667			 * pages to keep the chunks of work done where somewhat
>   1668			 * symmetric with the work writeback does.  This is a completely
>   1669			 * arbitrary number pulled out of thin air.
>   1670			 *
>   1671			 * Note that the values needs to be less than 32-bits wide until
>   1672			 * the lower level functions are updated.
>   1673			 */
>   1674			count = min_t(loff_t, count, 1024 * PAGE_SIZE);
>   1675			end_fsb = xfs_iomap_end_fsb(mp, offset, count);
>   1676	
>   1677			if (xfs_is_always_cow_inode(ip))
>   1678				allocfork = XFS_COW_FORK;
>   1679		}
>   1680	
>   1681		if (eof && offset + count > XFS_ISIZE(ip)) {
>   1682			/*
>   1683			 * Determine the initial size of the preallocation.
>   1684			 * We clean up any extra preallocation when the file is closed.
>   1685			 */
>   1686			if (xfs_has_allocsize(mp))
>   1687				prealloc_blocks = mp->m_allocsize_blocks;
>   1688			else if (allocfork == XFS_DATA_FORK)
>   1689				prealloc_blocks = xfs_iomap_prealloc_size(ip, allocfork,
>   1690							offset, count, &icur);
>   1691			else
>   1692				prealloc_blocks = xfs_iomap_prealloc_size(ip, allocfork,
>   1693							offset, count, &ccur);
>   1694			if (prealloc_blocks) {
>   1695				xfs_extlen_t	align;
>   1696				xfs_off_t	end_offset;
>   1697				xfs_fileoff_t	p_end_fsb;
>   1698	
>   1699				end_offset = XFS_ALLOC_ALIGN(mp, offset + count - 1);
>   1700				p_end_fsb = XFS_B_TO_FSBT(mp, end_offset) +
>   1701						prealloc_blocks;
>   1702	
>   1703				align = xfs_eof_alignment(ip);
>   1704				if (align)
>   1705					p_end_fsb = roundup_64(p_end_fsb, align);
>   1706	
>   1707				p_end_fsb = min(p_end_fsb,
>   1708					XFS_B_TO_FSB(mp, mp->m_super->s_maxbytes));
>   1709				ASSERT(p_end_fsb > offset_fsb);
>   1710				prealloc_blocks = p_end_fsb - end_fsb;
>   1711			}
>   1712		}
>   1713	
>   1714		/*
>   1715		 * Flag newly allocated delalloc blocks with IOMAP_F_NEW so we punch
>   1716		 * them out if the write happens to fail.
>   1717		 */
>   1718		iomap_flags |= IOMAP_F_NEW;
>   1719		if (allocfork == XFS_COW_FORK) {
>   1720			error = xfs_bmapi_reserve_delalloc(ip, allocfork, offset_fsb,
>   1721					end_fsb - offset_fsb, prealloc_blocks, &cmap,
>   1722					&ccur, cow_eof);
>   1723			if (error)
>   1724				goto out_unlock;
>   1725	
>   1726			trace_xfs_iomap_alloc(ip, offset, count, allocfork, &cmap);
>   1727			goto found_cow;
>   1728		}
>   1729	
>   1730		error = xfs_bmapi_reserve_delalloc(ip, allocfork, offset_fsb,
>   1731				end_fsb - offset_fsb, prealloc_blocks, &imap, &icur,
>   1732				eof);
>   1733		if (error)
>   1734			goto out_unlock;
>   1735	
>   1736		trace_xfs_iomap_alloc(ip, offset, count, allocfork, &imap);
>   1737	found_imap:
>   1738		seq = xfs_iomap_inode_sequence(ip, iomap_flags);
>   1739		xfs_iunlock(ip, lockmode);
>   1740		return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, iomap_flags, seq);
>   1741	
>   1742	convert_delay:
>   1743		xfs_iunlock(ip, lockmode);
>   1744		truncate_pagecache(inode, offset);
>   1745		error = xfs_bmapi_convert_delalloc(ip, XFS_DATA_FORK, offset,
>   1746						   iomap, NULL);
>   1747		if (error)
>   1748			return error;
>   1749	
>   1750		trace_xfs_iomap_alloc(ip, offset, count, XFS_DATA_FORK, &imap);
>   1751		return 0;
>   1752	
>   1753	found_cow:
>   1754		if (imap.br_startoff <= offset_fsb) {
>   1755			error = xfs_bmbt_to_iomap(ip, srcmap, &imap, flags, 0,
>   1756					xfs_iomap_inode_sequence(ip, 0));
>   1757			if (error)
>   1758				goto out_unlock;
>   1759		} else {
>   1760			xfs_trim_extent(&cmap, offset_fsb,
>   1761					imap.br_startoff - offset_fsb);
>   1762		}
>   1763	
>   1764		iomap_flags |= IOMAP_F_SHARED;
>   1765		seq = xfs_iomap_inode_sequence(ip, iomap_flags);
>   1766		xfs_iunlock(ip, lockmode);
>   1767		return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags, iomap_flags, seq);
>   1768	
>   1769	out_unlock:
>   1770		xfs_iunlock(ip, lockmode);
>   1771		return error;
>   1772	}
>   1773	
> 
> -- 
> 0-DAY CI Kernel Test Service
> https://github.com/intel/lkp-tests/wiki
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/7] filemap: add helper to look up dirty folios in a range
  2025-06-05 17:33 ` [PATCH 2/7] filemap: add helper to look up dirty folios in a range Brian Foster
@ 2025-06-09 15:48   ` Darrick J. Wong
  2025-06-10  4:21     ` Christoph Hellwig
  2025-06-10 12:17     ` Brian Foster
  2025-06-10  4:22   ` Christoph Hellwig
  1 sibling, 2 replies; 40+ messages in thread
From: Darrick J. Wong @ 2025-06-09 15:48 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, linux-xfs, linux-mm

On Thu, Jun 05, 2025 at 01:33:52PM -0400, Brian Foster wrote:
> Add a new filemap_get_folios_dirty() helper to look up existing dirty
> folios in a range and add them to a folio_batch. This is to support
> optimization of certain iomap operations that only care about dirty
> folios in a target range. For example, zero range only zeroes the subset
> of dirty pages over unwritten mappings, seek hole/data may use similar
> logic in the future, etc.
> 
> Note that the helper is intended for use under internal fs locks.
> Therefore it trylocks folios in order to filter out clean folios.
> This loosely follows the logic from filemap_range_has_writeback().
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>

You might want to cc willy directly on this one... 
> ---
>  include/linux/pagemap.h |  2 ++
>  mm/filemap.c            | 42 +++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 44 insertions(+)
> 
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index e63fbfbd5b0f..fb83ddf26621 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -941,6 +941,8 @@ unsigned filemap_get_folios_contig(struct address_space *mapping,
>  		pgoff_t *start, pgoff_t end, struct folio_batch *fbatch);
>  unsigned filemap_get_folios_tag(struct address_space *mapping, pgoff_t *start,
>  		pgoff_t end, xa_mark_t tag, struct folio_batch *fbatch);
> +unsigned filemap_get_folios_dirty(struct address_space *mapping,
> +		pgoff_t *start, pgoff_t end, struct folio_batch *fbatch);
>  
>  /*
>   * Returns locked page at given index in given cache, creating it if needed.
> diff --git a/mm/filemap.c b/mm/filemap.c
> index bada249b9fb7..d28e984cdfd4 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2334,6 +2334,48 @@ unsigned filemap_get_folios_tag(struct address_space *mapping, pgoff_t *start,
>  }
>  EXPORT_SYMBOL(filemap_get_folios_tag);
>  
> +unsigned filemap_get_folios_dirty(struct address_space *mapping, pgoff_t *start,
> +			pgoff_t end, struct folio_batch *fbatch)

This ought to have a comment explaining what the function does.
It identifies every folio starting at @*start and ending before @end
that is dirty and tries to assign them to @fbatch, right?

The code looks reasonable to me; hopefully there aren't some subtleties
that I'm missing here :P

> +{
> +	XA_STATE(xas, &mapping->i_pages, *start);
> +	struct folio *folio;
> +
> +	rcu_read_lock();
> +	while ((folio = find_get_entry(&xas, end, XA_PRESENT)) != NULL) {
> +		if (xa_is_value(folio))
> +			continue;
> +		if (folio_trylock(folio)) {
> +			bool clean = !folio_test_dirty(folio) &&
> +				     !folio_test_writeback(folio);
> +			folio_unlock(folio);
> +			if (clean) {
> +				folio_put(folio);
> +				continue;
> +			}
> +		}
> +		if (!folio_batch_add(fbatch, folio)) {
> +			unsigned long nr = folio_nr_pages(folio);
> +			*start = folio->index + nr;
> +			goto out;
> +		}
> +	}
> +	/*
> +	 * We come here when there is no page beyond @end. We take care to not

...no folio beyond @end?

--D

> +	 * overflow the index @start as it confuses some of the callers. This
> +	 * breaks the iteration when there is a page at index -1 but that is
> +	 * already broke anyway.
> +	 */
> +	if (end == (pgoff_t)-1)
> +		*start = (pgoff_t)-1;
> +	else
> +		*start = end + 1;
> +out:
> +	rcu_read_unlock();
> +
> +	return folio_batch_count(fbatch);
> +}
> +EXPORT_SYMBOL(filemap_get_folios_dirty);
> +
>  /*
>   * CD/DVDs are error prone. When a medium error occurs, the driver may fail
>   * a _large_ part of the i/o request. Imagine the worst scenario:
> -- 
> 2.49.0
> 
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/7] iomap: optional zero range dirty folio processing
  2025-06-05 17:33 ` [PATCH 3/7] iomap: optional zero range dirty folio processing Brian Foster
@ 2025-06-09 16:04   ` Darrick J. Wong
  2025-06-10  4:27     ` Christoph Hellwig
  2025-06-10 12:21     ` Brian Foster
  2025-06-10  4:27   ` Christoph Hellwig
  1 sibling, 2 replies; 40+ messages in thread
From: Darrick J. Wong @ 2025-06-09 16:04 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, linux-xfs, linux-mm

On Thu, Jun 05, 2025 at 01:33:53PM -0400, Brian Foster wrote:
> The only way zero range can currently process unwritten mappings
> with dirty pagecache is to check whether the range is dirty before
> mapping lookup and then flush when at least one underlying mapping
> is unwritten. This ordering is required to prevent iomap lookup from
> racing with folio writeback and reclaim.
> 
> Since zero range can skip ranges of unwritten mappings that are
> clean in cache, this operation can be improved by allowing the
> filesystem to provide a set of dirty folios that require zeroing. In
> turn, rather than flush or iterate file offsets, zero range can
> iterate on folios in the batch and advance over clean or uncached
> ranges in between.
> 
> Add a folio_batch in struct iomap and provide a helper for fs' to
> populate the batch at lookup time. Update the folio lookup path to
> return the next folio in the batch, if provided, and advance the
> iter if the folio starts beyond the current offset.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/iomap/buffered-io.c | 73 +++++++++++++++++++++++++++++++++++++++---
>  fs/iomap/iter.c        |  6 ++++
>  include/linux/iomap.h  |  4 +++
>  3 files changed, 78 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 16499655e7b0..cf2f4f869920 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -750,6 +750,16 @@ static struct folio *__iomap_get_folio(struct iomap_iter *iter, size_t len)
>  	if (!mapping_large_folio_support(iter->inode->i_mapping))
>  		len = min_t(size_t, len, PAGE_SIZE - offset_in_page(pos));
>  
> +	if (iter->fbatch) {
> +		struct folio *folio = folio_batch_next(iter->fbatch);
> +
> +		if (folio) {
> +			folio_get(folio);
> +			folio_lock(folio);

Hrm.  So each folio that is added to the batch isn't locked, nor does
the batch (or iomap) hold a refcount on the folio until we get here.  Do
we have to re-check that folio->{mapping,index} match what iomap is
trying to process?  Or can we assume that nobody has removed the folio
from the mapping?

I'm wondering because __filemap_get_folio/filemap_get_entry seem to do
all that for us.  I think the folio_pos check below might cover some of
that revalidation?

> +		}
> +		return folio;
> +	}
> +
>  	if (folio_ops && folio_ops->get_folio)
>  		return folio_ops->get_folio(iter, pos, len);
>  	else
> @@ -811,6 +821,8 @@ static int iomap_write_begin(struct iomap_iter *iter, struct folio **foliop,
>  	int status = 0;
>  
>  	len = min_not_zero(len, *plen);
> +	*foliop = NULL;
> +	*plen = 0;
>  
>  	if (fatal_signal_pending(current))
>  		return -EINTR;
> @@ -819,6 +831,12 @@ static int iomap_write_begin(struct iomap_iter *iter, struct folio **foliop,
>  	if (IS_ERR(folio))
>  		return PTR_ERR(folio);
>  
> +	/* no folio means we're done with a batch */

...ran out of folios but *plen is nonzero, i.e. we still have range to
cover?

> +	if (!folio) {
> +		WARN_ON_ONCE(!iter->fbatch);
> +		return 0;
> +	}
> +
>  	/*
>  	 * Now we have a locked folio, before we do anything with it we need to
>  	 * check that the iomap we have cached is not stale. The inode extent
> @@ -839,6 +857,20 @@ static int iomap_write_begin(struct iomap_iter *iter, struct folio **foliop,
>  		}
>  	}
>  
> +	/*
> +	 * folios in a batch may not be contiguous. If we've skipped forward,
> +	 * advance the iter to the pos of the current folio. If the folio starts
> +	 * beyond the end of the mapping, it may have been trimmed since the
> +	 * lookup for whatever reason. Return a NULL folio to terminate the op.
> +	 */
> +	if (folio_pos(folio) > iter->pos) {
> +		len = min_t(u64, folio_pos(folio) - iter->pos,
> +				 iomap_length(iter));
> +		status = iomap_iter_advance(iter, &len);
> +		if (status || !len)
> +			goto out_unlock;
> +	}
> +
>  	pos = iomap_trim_folio_range(iter, folio, poffset, &len);
>  	BUG_ON(pos + len > iter->iomap.offset + iter->iomap.length);
>  	if (srcmap != &iter->iomap)
> @@ -1380,6 +1412,12 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
>  		if (iter->iomap.flags & IOMAP_F_STALE)
>  			break;
>  
> +		/* a NULL folio means we're done with a folio batch */
> +		if (!folio) {
> +			status = iomap_iter_advance_full(iter);
> +			break;
> +		}
> +
>  		/* warn about zeroing folios beyond eof that won't write back */
>  		WARN_ON_ONCE(folio_pos(folio) > iter->inode->i_size);
>  
> @@ -1401,6 +1439,26 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
>  	return status;
>  }
>  
> +loff_t
> +iomap_fill_dirty_folios(
> +	struct iomap_iter	*iter,
> +	loff_t			offset,
> +	loff_t			length)
> +{
> +	struct address_space	*mapping = iter->inode->i_mapping;
> +	pgoff_t			start = offset >> PAGE_SHIFT;
> +	pgoff_t			end = (offset + length - 1) >> PAGE_SHIFT;
> +
> +	iter->fbatch = kmalloc(sizeof(struct folio_batch), GFP_KERNEL);
> +	if (!iter->fbatch)
> +		return offset + length;
> +	folio_batch_init(iter->fbatch);
> +
> +	filemap_get_folios_dirty(mapping, &start, end, iter->fbatch);
> +	return (start << PAGE_SHIFT);
> +}
> +EXPORT_SYMBOL_GPL(iomap_fill_dirty_folios);

Not used anywhere ... ?

Oh, it's in the next patch.

> +
>  int
>  iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
>  		const struct iomap_ops *ops, void *private)
> @@ -1429,7 +1487,7 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
>  	 * flushing on partial eof zeroing, special case it to zero the
>  	 * unaligned start portion if already dirty in pagecache.
>  	 */
> -	if (off &&
> +	if (!iter.fbatch && off &&
>  	    filemap_range_needs_writeback(mapping, pos, pos + plen - 1)) {
>  		iter.len = plen;
>  		while ((ret = iomap_iter(&iter, ops)) > 0)
> @@ -1445,13 +1503,18 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
>  	 * if dirty and the fs returns a mapping that might convert on
>  	 * writeback.
>  	 */
> -	range_dirty = filemap_range_needs_writeback(inode->i_mapping,
> -					iter.pos, iter.pos + iter.len - 1);
> +	range_dirty = filemap_range_needs_writeback(mapping, iter.pos,
> +					iter.pos + iter.len - 1);
>  	while ((ret = iomap_iter(&iter, ops)) > 0) {
>  		const struct iomap *srcmap = iomap_iter_srcmap(&iter);
>  
> -		if (srcmap->type == IOMAP_HOLE ||
> -		    srcmap->type == IOMAP_UNWRITTEN) {
> +		if (WARN_ON_ONCE(iter.fbatch &&
> +				 srcmap->type != IOMAP_UNWRITTEN))

I wonder, are you planning to expand the folio batching to other
buffered-io.c operations?  Such that the iter.fbatch checks might some
day go away?

--D

> +			return -EIO;
> +
> +		if (!iter.fbatch &&
> +		    (srcmap->type == IOMAP_HOLE ||
> +		     srcmap->type == IOMAP_UNWRITTEN)) {
>  			s64 status;
>  
>  			if (range_dirty) {
> diff --git a/fs/iomap/iter.c b/fs/iomap/iter.c
> index 6ffc6a7b9ba5..89bd5951a6fd 100644
> --- a/fs/iomap/iter.c
> +++ b/fs/iomap/iter.c
> @@ -9,6 +9,12 @@
>  
>  static inline void iomap_iter_reset_iomap(struct iomap_iter *iter)
>  {
> +	if (iter->fbatch) {
> +		folio_batch_release(iter->fbatch);
> +		kfree(iter->fbatch);
> +		iter->fbatch = NULL;
> +	}
> +
>  	iter->status = 0;
>  	memset(&iter->iomap, 0, sizeof(iter->iomap));
>  	memset(&iter->srcmap, 0, sizeof(iter->srcmap));
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 522644d62f30..0b9b460b2873 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -9,6 +9,7 @@
>  #include <linux/types.h>
>  #include <linux/mm_types.h>
>  #include <linux/blkdev.h>
> +#include <linux/pagevec.h>
>  
>  struct address_space;
>  struct fiemap_extent_info;
> @@ -239,6 +240,7 @@ struct iomap_iter {
>  	unsigned flags;
>  	struct iomap iomap;
>  	struct iomap srcmap;
> +	struct folio_batch *fbatch;
>  	void *private;
>  };
>  
> @@ -345,6 +347,8 @@ void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len);
>  bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio);
>  int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
>  		const struct iomap_ops *ops);
> +loff_t iomap_fill_dirty_folios(struct iomap_iter *iter, loff_t offset,
> +		loff_t length);
>  int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len,
>  		bool *did_zero, const struct iomap_ops *ops, void *private);
>  int iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
> -- 
> 2.49.0
> 
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/7] xfs: always trim mapping to requested range for zero range
  2025-06-05 17:33 ` [PATCH 4/7] xfs: always trim mapping to requested range for zero range Brian Foster
@ 2025-06-09 16:07   ` Darrick J. Wong
  0 siblings, 0 replies; 40+ messages in thread
From: Darrick J. Wong @ 2025-06-09 16:07 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, linux-xfs, linux-mm

On Thu, Jun 05, 2025 at 01:33:54PM -0400, Brian Foster wrote:
> Refactor and tweak the IOMAP_ZERO logic in preparation to support
> filling the folio batch for unwritten mappings. Drop the superfluous
> imap offset check since the hole case has already been filtered out.
> Split the the delalloc case handling into a sub-branch, and always
> trim the imap to the requested offset/count so it can be more easily
> used to bound the range to lookup in pagecache.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Yeah, makes sense to me.
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_iomap.c | 17 ++++++++---------
>  1 file changed, 8 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index ff05e6b1b0bb..b5cf5bc6308d 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -1756,21 +1756,20 @@ xfs_buffered_write_iomap_begin(
>  	}
>  
>  	/*
> -	 * For zeroing, trim a delalloc extent that extends beyond the EOF
> -	 * block.  If it starts beyond the EOF block, convert it to an
> +	 * For zeroing, trim extents that extend beyond the EOF block. If a
> +	 * delalloc extent starts beyond the EOF block, convert it to an
>  	 * unwritten extent.
>  	 */
> -	if ((flags & IOMAP_ZERO) && imap.br_startoff <= offset_fsb &&
> -	    isnullstartblock(imap.br_startblock)) {
> +	if (flags & IOMAP_ZERO) {
>  		xfs_fileoff_t eof_fsb = XFS_B_TO_FSB(mp, XFS_ISIZE(ip));
>  
> -		if (offset_fsb >= eof_fsb)
> +		if (isnullstartblock(imap.br_startblock) &&
> +		    offset_fsb >= eof_fsb)
>  			goto convert_delay;
> -		if (end_fsb > eof_fsb) {
> +		if (offset_fsb < eof_fsb && end_fsb > eof_fsb)
>  			end_fsb = eof_fsb;
> -			xfs_trim_extent(&imap, offset_fsb,
> -					end_fsb - offset_fsb);
> -		}
> +
> +		xfs_trim_extent(&imap, offset_fsb, end_fsb - offset_fsb);
>  	}
>  
>  	/*
> -- 
> 2.49.0
> 
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 5/7] xfs: fill dirty folios on zero range of unwritten mappings
  2025-06-05 17:33 ` [PATCH 5/7] xfs: fill dirty folios on zero range of unwritten mappings Brian Foster
  2025-06-06  2:02   ` kernel test robot
@ 2025-06-09 16:12   ` Darrick J. Wong
  2025-06-10  4:31     ` Christoph Hellwig
  2025-06-10 12:24     ` Brian Foster
  1 sibling, 2 replies; 40+ messages in thread
From: Darrick J. Wong @ 2025-06-09 16:12 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, linux-xfs, linux-mm

On Thu, Jun 05, 2025 at 01:33:55PM -0400, Brian Foster wrote:
> Use the iomap folio batch mechanism to select folios to zero on zero
> range of unwritten mappings. Trim the resulting mapping if the batch
> is filled (unlikely for current use cases) to distinguish between a
> range to skip and one that requires another iteration due to a full
> batch.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/xfs_iomap.c | 23 +++++++++++++++++++++++
>  1 file changed, 23 insertions(+)
> 
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index b5cf5bc6308d..63054f7ead0e 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -1691,6 +1691,8 @@ xfs_buffered_write_iomap_begin(
>  	struct iomap		*iomap,
>  	struct iomap		*srcmap)
>  {
> +	struct iomap_iter	*iter = container_of(iomap, struct iomap_iter,
> +						     iomap);

/me has been wondering more and more if we should just pass the iter
directly to iomap_begin rather than make them play these container_of
tricks... OTOH I think the whole point of this:

	ret = ops->iomap_begin(iter->inode, iter->pos, iter->len, iter->flags,
			       &iter->iomap, &iter->srcmap);

is to "avoid" allowing the iomap users to mess with the internals of the
iomap iter...

>  	struct xfs_inode	*ip = XFS_I(inode);
>  	struct xfs_mount	*mp = ip->i_mount;
>  	xfs_fileoff_t		offset_fsb = XFS_B_TO_FSBT(mp, offset);
> @@ -1762,6 +1764,7 @@ xfs_buffered_write_iomap_begin(
>  	 */
>  	if (flags & IOMAP_ZERO) {
>  		xfs_fileoff_t eof_fsb = XFS_B_TO_FSB(mp, XFS_ISIZE(ip));
> +		u64 end;
>  
>  		if (isnullstartblock(imap.br_startblock) &&
>  		    offset_fsb >= eof_fsb)
> @@ -1769,6 +1772,26 @@ xfs_buffered_write_iomap_begin(
>  		if (offset_fsb < eof_fsb && end_fsb > eof_fsb)
>  			end_fsb = eof_fsb;
>  
> +		/*
> +		 * Look up dirty folios for unwritten mappings within EOF.
> +		 * Providing this bypasses the flush iomap uses to trigger
> +		 * extent conversion when unwritten mappings have dirty
> +		 * pagecache in need of zeroing.
> +		 *
> +		 * Trim the mapping to the end pos of the lookup, which in turn
> +		 * was trimmed to the end of the batch if it became full before
> +		 * the end of the mapping.
> +		 */
> +		if (imap.br_state == XFS_EXT_UNWRITTEN &&
> +		    offset_fsb < eof_fsb) {
> +			loff_t len = min(count,
> +					 XFS_FSB_TO_B(mp, imap.br_blockcount));
> +
> +			end = iomap_fill_dirty_folios(iter, offset, len);

...though I wonder, does this need to happen in
xfs_buffered_write_iomap_begin?  Is it required to hold the ILOCK while
we go look for folios in the mapping?  Or could this become a part of
iomap_write_begin?

--D

> +			end_fsb = min_t(xfs_fileoff_t, end_fsb,
> +					XFS_B_TO_FSB(mp, end));
> +		}
> +
>  		xfs_trim_extent(&imap, offset_fsb, end_fsb - offset_fsb);
>  	}
>  
> -- 
> 2.49.0
> 
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/7] iomap: move pos+len BUG_ON() to after folio lookup
  2025-06-05 17:33 ` [PATCH 1/7] iomap: move pos+len BUG_ON() to after folio lookup Brian Foster
@ 2025-06-09 16:16   ` Darrick J. Wong
  2025-06-10  4:20     ` Christoph Hellwig
  0 siblings, 1 reply; 40+ messages in thread
From: Darrick J. Wong @ 2025-06-09 16:16 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, linux-xfs, linux-mm

On Thu, Jun 05, 2025 at 01:33:51PM -0400, Brian Foster wrote:
> The bug checks at the top of iomap_write_begin() assume the pos/len
> reflect exactly the next range to process. This may no longer be the
> case once the get folio path is able to process a folio batch from
> the filesystem. Move the check a bit further down after the folio
> lookup and range trim to verify everything lines up with the current
> iomap.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/iomap/buffered-io.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 3729391a18f3..16499655e7b0 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -805,15 +805,12 @@ static int iomap_write_begin(struct iomap_iter *iter, struct folio **foliop,
>  {
>  	const struct iomap_folio_ops *folio_ops = iter->iomap.folio_ops;
>  	const struct iomap *srcmap = iomap_iter_srcmap(iter);
> -	loff_t pos = iter->pos;
> +	loff_t pos;
>  	u64 len = min_t(u64, SIZE_MAX, iomap_length(iter));
>  	struct folio *folio;
>  	int status = 0;
>  
>  	len = min_not_zero(len, *plen);
> -	BUG_ON(pos + len > iter->iomap.offset + iter->iomap.length);
> -	if (srcmap != &iter->iomap)
> -		BUG_ON(pos + len > srcmap->offset + srcmap->length);

Hmm.  Do we even /need/ these checks?

len is already basically just min(SIZE_MAX, iter->len,
iomap->offset + iomap->length, srcmap->offset + srcmap->length)

So by definition they should never trigger, right?

--D

>  
>  	if (fatal_signal_pending(current))
>  		return -EINTR;
> @@ -843,6 +840,9 @@ static int iomap_write_begin(struct iomap_iter *iter, struct folio **foliop,
>  	}
>  
>  	pos = iomap_trim_folio_range(iter, folio, poffset, &len);
> +	BUG_ON(pos + len > iter->iomap.offset + iter->iomap.length);
> +	if (srcmap != &iter->iomap)
> +		BUG_ON(pos + len > srcmap->offset + srcmap->length);
>  
>  	if (srcmap->type == IOMAP_INLINE)
>  		status = iomap_write_begin_inline(iter, folio);
> -- 
> 2.49.0
> 
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/7] iomap: move pos+len BUG_ON() to after folio lookup
  2025-06-09 16:16   ` Darrick J. Wong
@ 2025-06-10  4:20     ` Christoph Hellwig
  2025-06-10 12:16       ` Brian Foster
  0 siblings, 1 reply; 40+ messages in thread
From: Christoph Hellwig @ 2025-06-10  4:20 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Brian Foster, linux-fsdevel, linux-xfs, linux-mm

On Mon, Jun 09, 2025 at 09:16:49AM -0700, Darrick J. Wong wrote:
> Hmm.  Do we even /need/ these checks?
> 
> len is already basically just min(SIZE_MAX, iter->len,
> iomap->offset + iomap->length, srcmap->offset + srcmap->length)
> 
> So by definition they should never trigger, right?

Yes, now that it is after the range trim it feels pretty pointless.
So count me in for just removing it.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/7] filemap: add helper to look up dirty folios in a range
  2025-06-09 15:48   ` Darrick J. Wong
@ 2025-06-10  4:21     ` Christoph Hellwig
  2025-06-10 12:17     ` Brian Foster
  1 sibling, 0 replies; 40+ messages in thread
From: Christoph Hellwig @ 2025-06-10  4:21 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Brian Foster, linux-fsdevel, linux-xfs, linux-mm

On Mon, Jun 09, 2025 at 08:48:02AM -0700, Darrick J. Wong wrote:
> You might want to cc willy directly on this one... 

And keep it first in the series as the non-subsystem dependency.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/7] filemap: add helper to look up dirty folios in a range
  2025-06-05 17:33 ` [PATCH 2/7] filemap: add helper to look up dirty folios in a range Brian Foster
  2025-06-09 15:48   ` Darrick J. Wong
@ 2025-06-10  4:22   ` Christoph Hellwig
  1 sibling, 0 replies; 40+ messages in thread
From: Christoph Hellwig @ 2025-06-10  4:22 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, linux-xfs, linux-mm

On Thu, Jun 05, 2025 at 01:33:52PM -0400, Brian Foster wrote:
> +EXPORT_SYMBOL(filemap_get_folios_dirty);

Right now this is only used in iomap, so it doesn't need an export
at all.  But if it had one it should be EXPORT_SYMBOL_GPL.

Otherwise this looks fine, but as Darrick noted a kerneldoc comment
would be nice.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/7] iomap: optional zero range dirty folio processing
  2025-06-09 16:04   ` Darrick J. Wong
@ 2025-06-10  4:27     ` Christoph Hellwig
  2025-06-10 12:21       ` Brian Foster
  2025-06-10 12:21     ` Brian Foster
  1 sibling, 1 reply; 40+ messages in thread
From: Christoph Hellwig @ 2025-06-10  4:27 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Brian Foster, linux-fsdevel, linux-xfs, linux-mm

On Mon, Jun 09, 2025 at 09:04:20AM -0700, Darrick J. Wong wrote:
> > +	if (iter->fbatch) {
> > +		struct folio *folio = folio_batch_next(iter->fbatch);
> > +
> > +		if (folio) {
> > +			folio_get(folio);
> > +			folio_lock(folio);
> 
> Hrm.  So each folio that is added to the batch isn't locked, nor does
> the batch (or iomap) hold a refcount on the folio until we get here.

find_get_entry references a folio, and filemap_get_folios_dirty
preserves that reference.  It will be released in folio_batch_release.
So this just add an extra reference for local operation so that the
rest of iomap doesn't need to know about that magic.

> Do
> we have to re-check that folio->{mapping,index} match what iomap is
> trying to process?  Or can we assume that nobody has removed the folio
> from the mapping?

That's a good point, though  as without having the folio locked it
could get truncated, so I think we'll have to redo the truncate
check here.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/7] iomap: optional zero range dirty folio processing
  2025-06-05 17:33 ` [PATCH 3/7] iomap: optional zero range dirty folio processing Brian Foster
  2025-06-09 16:04   ` Darrick J. Wong
@ 2025-06-10  4:27   ` Christoph Hellwig
  1 sibling, 0 replies; 40+ messages in thread
From: Christoph Hellwig @ 2025-06-10  4:27 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, linux-xfs, linux-mm

> +	 * folios in a batch may not be contiguous. If we've skipped forward,

Start the sentence with a capitalized word?


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 5/7] xfs: fill dirty folios on zero range of unwritten mappings
  2025-06-09 16:12   ` Darrick J. Wong
@ 2025-06-10  4:31     ` Christoph Hellwig
  2025-06-10 12:24     ` Brian Foster
  1 sibling, 0 replies; 40+ messages in thread
From: Christoph Hellwig @ 2025-06-10  4:31 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Brian Foster, linux-fsdevel, linux-xfs, linux-mm

On Mon, Jun 09, 2025 at 09:12:19AM -0700, Darrick J. Wong wrote:
> > +	struct iomap_iter	*iter = container_of(iomap, struct iomap_iter,
> > +						     iomap);
> 
> /me has been wondering more and more if we should just pass the iter
> directly to iomap_begin rather than make them play these container_of
> tricks... OTOH I think the whole point of this:
> 
> 	ret = ops->iomap_begin(iter->inode, iter->pos, iter->len, iter->flags,
> 			       &iter->iomap, &iter->srcmap);
> 
> is to "avoid" allowing the iomap users to mess with the internals of the
> iomap iter...

No, the reason is that the ->iomap_begin interface predates
struct iomap_iter.  I delayed changing the prototype because I thought
I'd get to the second step of willy's iterization idea and kill these
methods in favor of a real iterator, but that just doesn't keep
happening.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 6/7] iomap: remove old partial eof zeroing optimization
  2025-06-05 17:33 ` [PATCH 6/7] iomap: remove old partial eof zeroing optimization Brian Foster
@ 2025-06-10  4:32   ` Christoph Hellwig
  0 siblings, 0 replies; 40+ messages in thread
From: Christoph Hellwig @ 2025-06-10  4:32 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, linux-xfs, linux-mm

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC 7/7] xfs: error tag to force zeroing on debug kernels
  2025-06-05 17:33 ` [PATCH RFC 7/7] xfs: error tag to force zeroing on debug kernels Brian Foster
@ 2025-06-10  4:33   ` Christoph Hellwig
  2025-06-10 12:26     ` Brian Foster
  0 siblings, 1 reply; 40+ messages in thread
From: Christoph Hellwig @ 2025-06-10  4:33 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, linux-xfs, linux-mm

On Thu, Jun 05, 2025 at 01:33:57PM -0400, Brian Foster wrote:
> iomap_zero_range() has to cover various corner cases that are
> difficult to test on production kernels because it is used in fairly
> limited use cases. For example, it is currently only used by XFS and
> mostly only in partial block zeroing cases.
> 
> While it's possible to test most of these functional cases, we can
> provide more robust test coverage by co-opting fallocate zero range
> to invoke zeroing of the entire range instead of the more efficient
> block punch/allocate sequence. Add an errortag to occasionally
> invoke forced zeroing.

I like this, having an easy way to improve code coverage using the
existing fallocate and errtag interfaces is always a good thing.

Can I assume you plan to add a testcase using the errtag to xfstests?


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/7] iomap: move pos+len BUG_ON() to after folio lookup
  2025-06-10  4:20     ` Christoph Hellwig
@ 2025-06-10 12:16       ` Brian Foster
  0 siblings, 0 replies; 40+ messages in thread
From: Brian Foster @ 2025-06-10 12:16 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Darrick J. Wong, linux-fsdevel, linux-xfs, linux-mm

On Mon, Jun 09, 2025 at 09:20:51PM -0700, Christoph Hellwig wrote:
> On Mon, Jun 09, 2025 at 09:16:49AM -0700, Darrick J. Wong wrote:
> > Hmm.  Do we even /need/ these checks?
> > 
> > len is already basically just min(SIZE_MAX, iter->len,
> > iomap->offset + iomap->length, srcmap->offset + srcmap->length)
> > 
> > So by definition they should never trigger, right?
> 
> Yes, now that it is after the range trim it feels pretty pointless.
> So count me in for just removing it.
> 
> 

Fair points.. I'll update this patch to just drop it.

Brian


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/7] filemap: add helper to look up dirty folios in a range
  2025-06-09 15:48   ` Darrick J. Wong
  2025-06-10  4:21     ` Christoph Hellwig
@ 2025-06-10 12:17     ` Brian Foster
  1 sibling, 0 replies; 40+ messages in thread
From: Brian Foster @ 2025-06-10 12:17 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, linux-xfs, linux-mm

On Mon, Jun 09, 2025 at 08:48:02AM -0700, Darrick J. Wong wrote:
> On Thu, Jun 05, 2025 at 01:33:52PM -0400, Brian Foster wrote:
> > Add a new filemap_get_folios_dirty() helper to look up existing dirty
> > folios in a range and add them to a folio_batch. This is to support
> > optimization of certain iomap operations that only care about dirty
> > folios in a target range. For example, zero range only zeroes the subset
> > of dirty pages over unwritten mappings, seek hole/data may use similar
> > logic in the future, etc.
> > 
> > Note that the helper is intended for use under internal fs locks.
> > Therefore it trylocks folios in order to filter out clean folios.
> > This loosely follows the logic from filemap_range_has_writeback().
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> 
> You might want to cc willy directly on this one... 

Er yeah, I'll do that for v2.

> > ---
> >  include/linux/pagemap.h |  2 ++
> >  mm/filemap.c            | 42 +++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 44 insertions(+)
> > 
...
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index bada249b9fb7..d28e984cdfd4 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -2334,6 +2334,48 @@ unsigned filemap_get_folios_tag(struct address_space *mapping, pgoff_t *start,
> >  }
> >  EXPORT_SYMBOL(filemap_get_folios_tag);
> >  
> > +unsigned filemap_get_folios_dirty(struct address_space *mapping, pgoff_t *start,
> > +			pgoff_t end, struct folio_batch *fbatch)
> 
> This ought to have a comment explaining what the function does.
> It identifies every folio starting at @*start and ending before @end
> that is dirty and tries to assign them to @fbatch, right?
> 

Yep, or at least every folio that starts before end. I'll add a comment
and incorporate all the followup feedback. Thanks.

Brian

> The code looks reasonable to me; hopefully there aren't some subtleties
> that I'm missing here :P
> 
> > +{
> > +	XA_STATE(xas, &mapping->i_pages, *start);
> > +	struct folio *folio;
> > +
> > +	rcu_read_lock();
> > +	while ((folio = find_get_entry(&xas, end, XA_PRESENT)) != NULL) {
> > +		if (xa_is_value(folio))
> > +			continue;
> > +		if (folio_trylock(folio)) {
> > +			bool clean = !folio_test_dirty(folio) &&
> > +				     !folio_test_writeback(folio);
> > +			folio_unlock(folio);
> > +			if (clean) {
> > +				folio_put(folio);
> > +				continue;
> > +			}
> > +		}
> > +		if (!folio_batch_add(fbatch, folio)) {
> > +			unsigned long nr = folio_nr_pages(folio);
> > +			*start = folio->index + nr;
> > +			goto out;
> > +		}
> > +	}
> > +	/*
> > +	 * We come here when there is no page beyond @end. We take care to not
> 
> ...no folio beyond @end?
> 
> --D
> 
> > +	 * overflow the index @start as it confuses some of the callers. This
> > +	 * breaks the iteration when there is a page at index -1 but that is
> > +	 * already broke anyway.
> > +	 */
> > +	if (end == (pgoff_t)-1)
> > +		*start = (pgoff_t)-1;
> > +	else
> > +		*start = end + 1;
> > +out:
> > +	rcu_read_unlock();
> > +
> > +	return folio_batch_count(fbatch);
> > +}
> > +EXPORT_SYMBOL(filemap_get_folios_dirty);
> > +
> >  /*
> >   * CD/DVDs are error prone. When a medium error occurs, the driver may fail
> >   * a _large_ part of the i/o request. Imagine the worst scenario:
> > -- 
> > 2.49.0
> > 
> > 
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/7] iomap: optional zero range dirty folio processing
  2025-06-09 16:04   ` Darrick J. Wong
  2025-06-10  4:27     ` Christoph Hellwig
@ 2025-06-10 12:21     ` Brian Foster
  2025-06-10 13:29       ` Christoph Hellwig
  2025-06-10 14:55       ` Darrick J. Wong
  1 sibling, 2 replies; 40+ messages in thread
From: Brian Foster @ 2025-06-10 12:21 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, linux-xfs, linux-mm

On Mon, Jun 09, 2025 at 09:04:20AM -0700, Darrick J. Wong wrote:
> On Thu, Jun 05, 2025 at 01:33:53PM -0400, Brian Foster wrote:
> > The only way zero range can currently process unwritten mappings
> > with dirty pagecache is to check whether the range is dirty before
> > mapping lookup and then flush when at least one underlying mapping
> > is unwritten. This ordering is required to prevent iomap lookup from
> > racing with folio writeback and reclaim.
> > 
> > Since zero range can skip ranges of unwritten mappings that are
> > clean in cache, this operation can be improved by allowing the
> > filesystem to provide a set of dirty folios that require zeroing. In
> > turn, rather than flush or iterate file offsets, zero range can
> > iterate on folios in the batch and advance over clean or uncached
> > ranges in between.
> > 
> > Add a folio_batch in struct iomap and provide a helper for fs' to
> > populate the batch at lookup time. Update the folio lookup path to
> > return the next folio in the batch, if provided, and advance the
> > iter if the folio starts beyond the current offset.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/iomap/buffered-io.c | 73 +++++++++++++++++++++++++++++++++++++++---
> >  fs/iomap/iter.c        |  6 ++++
> >  include/linux/iomap.h  |  4 +++
> >  3 files changed, 78 insertions(+), 5 deletions(-)
> > 
> > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > index 16499655e7b0..cf2f4f869920 100644
> > --- a/fs/iomap/buffered-io.c
> > +++ b/fs/iomap/buffered-io.c
> > @@ -750,6 +750,16 @@ static struct folio *__iomap_get_folio(struct iomap_iter *iter, size_t len)
> >  	if (!mapping_large_folio_support(iter->inode->i_mapping))
> >  		len = min_t(size_t, len, PAGE_SIZE - offset_in_page(pos));
> >  
> > +	if (iter->fbatch) {
> > +		struct folio *folio = folio_batch_next(iter->fbatch);
> > +
> > +		if (folio) {
> > +			folio_get(folio);
> > +			folio_lock(folio);
> 
> Hrm.  So each folio that is added to the batch isn't locked, nor does
> the batch (or iomap) hold a refcount on the folio until we get here.  Do
> we have to re-check that folio->{mapping,index} match what iomap is
> trying to process?  Or can we assume that nobody has removed the folio
> from the mapping?
> 

The filemap helper grabs a reference to the folio but doesn't
necessarily lock it. The ref is effectively transferred to the batch
there and the _get() here creates the iomap reference (i.e. that is
analogous to the traditional iomap get folio path). The batch is
ultimately released via folio_batch_release() and the iomap refs dropped
in the same way regardless of whether iomap grabbed it itself or was
part of a patch.

> I'm wondering because __filemap_get_folio/filemap_get_entry seem to do
> all that for us.  I think the folio_pos check below might cover some of
> that revalidation?
> 

I'm not totally sure the folio revalidation is necessarily required
here.. If it is, I'd also need to think about whether it's ok to skip
such folios or the approach here needs revisiting. I'll take a closer
look and also try to document this better and get some feedback from
people who know this code better in the next go around..

> > +		}
> > +		return folio;
> > +	}
> > +
> >  	if (folio_ops && folio_ops->get_folio)
> >  		return folio_ops->get_folio(iter, pos, len);
> >  	else
...
> > @@ -819,6 +831,12 @@ static int iomap_write_begin(struct iomap_iter *iter, struct folio **foliop,
> >  	if (IS_ERR(folio))
> >  		return PTR_ERR(folio);
> >  
> > +	/* no folio means we're done with a batch */
> 
> ...ran out of folios but *plen is nonzero, i.e. we still have range to
> cover?
> 

Yes I suppose that is implied by being in this path.. will fix.

> > +	if (!folio) {
> > +		WARN_ON_ONCE(!iter->fbatch);
> > +		return 0;
> > +	}
> > +
> >  	/*
> >  	 * Now we have a locked folio, before we do anything with it we need to
> >  	 * check that the iomap we have cached is not stale. The inode extent
...
> > +
> >  int
> >  iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
> >  		const struct iomap_ops *ops, void *private)
...
> > @@ -1445,13 +1503,18 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
> >  	 * if dirty and the fs returns a mapping that might convert on
> >  	 * writeback.
> >  	 */
> > -	range_dirty = filemap_range_needs_writeback(inode->i_mapping,
> > -					iter.pos, iter.pos + iter.len - 1);
> > +	range_dirty = filemap_range_needs_writeback(mapping, iter.pos,
> > +					iter.pos + iter.len - 1);
> >  	while ((ret = iomap_iter(&iter, ops)) > 0) {
> >  		const struct iomap *srcmap = iomap_iter_srcmap(&iter);
> >  
> > -		if (srcmap->type == IOMAP_HOLE ||
> > -		    srcmap->type == IOMAP_UNWRITTEN) {
> > +		if (WARN_ON_ONCE(iter.fbatch &&
> > +				 srcmap->type != IOMAP_UNWRITTEN))
> 
> I wonder, are you planning to expand the folio batching to other
> buffered-io.c operations?  Such that the iter.fbatch checks might some
> day go away?
> 

Yes.. but I'm not totally sure wrt impact on the fbatch checks quite
yet. The next thing I wanted to look at is addressing the same unwritten
mapping vs. dirty folios issue in the seek data/hole path. It's been a
little while since I last investigated there (and that was also before
the whole granular advance approach was devised), but IIRC it would look
rather similar to what this is doing for zero range. That may or may
not justify just making the batch required for both operations and
potentially simplifying this logic further. I'll keep that in mind when
I get to it..

After that, I may play around with the buffered write path, but that is
a larger change with slightly different scope and requirements..

Brian

> --D
> 
> > +			return -EIO;
> > +
> > +		if (!iter.fbatch &&
> > +		    (srcmap->type == IOMAP_HOLE ||
> > +		     srcmap->type == IOMAP_UNWRITTEN)) {
> >  			s64 status;
> >  
> >  			if (range_dirty) {
> > diff --git a/fs/iomap/iter.c b/fs/iomap/iter.c
> > index 6ffc6a7b9ba5..89bd5951a6fd 100644
> > --- a/fs/iomap/iter.c
> > +++ b/fs/iomap/iter.c
> > @@ -9,6 +9,12 @@
> >  
> >  static inline void iomap_iter_reset_iomap(struct iomap_iter *iter)
> >  {
> > +	if (iter->fbatch) {
> > +		folio_batch_release(iter->fbatch);
> > +		kfree(iter->fbatch);
> > +		iter->fbatch = NULL;
> > +	}
> > +
> >  	iter->status = 0;
> >  	memset(&iter->iomap, 0, sizeof(iter->iomap));
> >  	memset(&iter->srcmap, 0, sizeof(iter->srcmap));
> > diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> > index 522644d62f30..0b9b460b2873 100644
> > --- a/include/linux/iomap.h
> > +++ b/include/linux/iomap.h
> > @@ -9,6 +9,7 @@
> >  #include <linux/types.h>
> >  #include <linux/mm_types.h>
> >  #include <linux/blkdev.h>
> > +#include <linux/pagevec.h>
> >  
> >  struct address_space;
> >  struct fiemap_extent_info;
> > @@ -239,6 +240,7 @@ struct iomap_iter {
> >  	unsigned flags;
> >  	struct iomap iomap;
> >  	struct iomap srcmap;
> > +	struct folio_batch *fbatch;
> >  	void *private;
> >  };
> >  
> > @@ -345,6 +347,8 @@ void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len);
> >  bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio);
> >  int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
> >  		const struct iomap_ops *ops);
> > +loff_t iomap_fill_dirty_folios(struct iomap_iter *iter, loff_t offset,
> > +		loff_t length);
> >  int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len,
> >  		bool *did_zero, const struct iomap_ops *ops, void *private);
> >  int iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
> > -- 
> > 2.49.0
> > 
> > 
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/7] iomap: optional zero range dirty folio processing
  2025-06-10  4:27     ` Christoph Hellwig
@ 2025-06-10 12:21       ` Brian Foster
  0 siblings, 0 replies; 40+ messages in thread
From: Brian Foster @ 2025-06-10 12:21 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Darrick J. Wong, linux-fsdevel, linux-xfs, linux-mm

On Mon, Jun 09, 2025 at 09:27:07PM -0700, Christoph Hellwig wrote:
> On Mon, Jun 09, 2025 at 09:04:20AM -0700, Darrick J. Wong wrote:
> > > +	if (iter->fbatch) {
> > > +		struct folio *folio = folio_batch_next(iter->fbatch);
> > > +
> > > +		if (folio) {
> > > +			folio_get(folio);
> > > +			folio_lock(folio);
> > 
> > Hrm.  So each folio that is added to the batch isn't locked, nor does
> > the batch (or iomap) hold a refcount on the folio until we get here.
> 
> find_get_entry references a folio, and filemap_get_folios_dirty
> preserves that reference.  It will be released in folio_batch_release.
> So this just add an extra reference for local operation so that the
> rest of iomap doesn't need to know about that magic.
> 

Exactly.

> > Do
> > we have to re-check that folio->{mapping,index} match what iomap is
> > trying to process?  Or can we assume that nobody has removed the folio
> > from the mapping?
> 
> That's a good point, though  as without having the folio locked it
> could get truncated, so I think we'll have to redo the truncate
> check here.
> 
> 

Hm Ok thanks, I'll take a closer look at that..

Brian


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 5/7] xfs: fill dirty folios on zero range of unwritten mappings
  2025-06-09 16:12   ` Darrick J. Wong
  2025-06-10  4:31     ` Christoph Hellwig
@ 2025-06-10 12:24     ` Brian Foster
  2025-07-02 18:50       ` Darrick J. Wong
  1 sibling, 1 reply; 40+ messages in thread
From: Brian Foster @ 2025-06-10 12:24 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, linux-xfs, linux-mm

On Mon, Jun 09, 2025 at 09:12:19AM -0700, Darrick J. Wong wrote:
> On Thu, Jun 05, 2025 at 01:33:55PM -0400, Brian Foster wrote:
> > Use the iomap folio batch mechanism to select folios to zero on zero
> > range of unwritten mappings. Trim the resulting mapping if the batch
> > is filled (unlikely for current use cases) to distinguish between a
> > range to skip and one that requires another iteration due to a full
> > batch.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/xfs/xfs_iomap.c | 23 +++++++++++++++++++++++
> >  1 file changed, 23 insertions(+)
> > 
> > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > index b5cf5bc6308d..63054f7ead0e 100644
> > --- a/fs/xfs/xfs_iomap.c
> > +++ b/fs/xfs/xfs_iomap.c
...
> > @@ -1769,6 +1772,26 @@ xfs_buffered_write_iomap_begin(
> >  		if (offset_fsb < eof_fsb && end_fsb > eof_fsb)
> >  			end_fsb = eof_fsb;
> >  
> > +		/*
> > +		 * Look up dirty folios for unwritten mappings within EOF.
> > +		 * Providing this bypasses the flush iomap uses to trigger
> > +		 * extent conversion when unwritten mappings have dirty
> > +		 * pagecache in need of zeroing.
> > +		 *
> > +		 * Trim the mapping to the end pos of the lookup, which in turn
> > +		 * was trimmed to the end of the batch if it became full before
> > +		 * the end of the mapping.
> > +		 */
> > +		if (imap.br_state == XFS_EXT_UNWRITTEN &&
> > +		    offset_fsb < eof_fsb) {
> > +			loff_t len = min(count,
> > +					 XFS_FSB_TO_B(mp, imap.br_blockcount));
> > +
> > +			end = iomap_fill_dirty_folios(iter, offset, len);
> 
> ...though I wonder, does this need to happen in
> xfs_buffered_write_iomap_begin?  Is it required to hold the ILOCK while
> we go look for folios in the mapping?  Or could this become a part of
> iomap_write_begin?
> 

Technically it does not need to be inside ->iomap_begin(). The "dirty
check" just needs to be before the fs drops its own locks associated
with the mapping lookup to maintain functional correctness, and that
includes doing it before the callout in the first place (i.e. this is
how the filemap_range_needs_writeback() logic works). I have various
older prototype versions of that work that tried to do things a bit more
generically in that way, but ultimately they seemed less elegant for the
purpose of zero range.

WRT zero range, the main reason this is in the callback is that it's
only required to search for dirty folios when the underlying mapping is
unwritten, and we don't know that until the filesystem provides the
mapping (and doing at after the fs drops locks is racy).

That said, if we eventually use this for something like buffered writes,
that is not so much of an issue and we probably want to instead
lookup/allocate/lock each successive folio up front. That could likely
occur at the iomap level (lock ordering issues and whatnot
notwithstanding).

The one caveat with zero range is that it's only really used for small
ranges in practice, so it may not really be that big of a deal if the
folio lookup occurred unconditionally. I think the justification for
that is tied to broader using of batching in iomap, however, so I don't
really want to force the issue unless it proves worthwhile. IOW what I'm
trying to say is that if we do end up with a few more ops using this
mechanism, it wouldn't surprise me if we just decided to deduplicate to
the lowest common denominator implementation at that point (and do the
lookups in iomap iter or something). We're just not there yet IMO.

Brian

> --D
> 
> > +			end_fsb = min_t(xfs_fileoff_t, end_fsb,
> > +					XFS_B_TO_FSB(mp, end));
> > +		}
> > +
> >  		xfs_trim_extent(&imap, offset_fsb, end_fsb - offset_fsb);
> >  	}
> >  
> > -- 
> > 2.49.0
> > 
> > 
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC 7/7] xfs: error tag to force zeroing on debug kernels
  2025-06-10  4:33   ` Christoph Hellwig
@ 2025-06-10 12:26     ` Brian Foster
  2025-06-10 13:30       ` Christoph Hellwig
  0 siblings, 1 reply; 40+ messages in thread
From: Brian Foster @ 2025-06-10 12:26 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, linux-xfs, linux-mm

On Mon, Jun 09, 2025 at 09:33:37PM -0700, Christoph Hellwig wrote:
> On Thu, Jun 05, 2025 at 01:33:57PM -0400, Brian Foster wrote:
> > iomap_zero_range() has to cover various corner cases that are
> > difficult to test on production kernels because it is used in fairly
> > limited use cases. For example, it is currently only used by XFS and
> > mostly only in partial block zeroing cases.
> > 
> > While it's possible to test most of these functional cases, we can
> > provide more robust test coverage by co-opting fallocate zero range
> > to invoke zeroing of the entire range instead of the more efficient
> > block punch/allocate sequence. Add an errortag to occasionally
> > invoke forced zeroing.
> 
> I like this, having an easy way to improve code coverage using the
> existing fallocate and errtag interfaces is always a good thing.
> 
> Can I assume you plan to add a testcase using the errtag to xfstests?
> 

Well that is kind of the question.. ;) My preference was to either add
something to fstests to enable select errortags by default on every
mount (or do the same in-kernel via XFS_DEBUG[_ERRTAGS] or some such)
over just creating a one-off test that runs fsx or whatever with this
error tag turned on. [1].

That said, I wouldn't be opposed to just doing both if folks prefer
that. It just bugs me to add yet another test that only runs a specific
fsx test when we get much more coverage by running the full suite of
tests. IOW, whenever somebody is testing a kernel that would actually
run a custom test (XFS_DEBUG plus specific errortag support), we could
in theory be running the whole suite with the same errortag turned on
(albeit perhaps at a lesser frequency than a custom test would use). So
from that perspective I'm not sure it makes a whole lot of sense to do
both.

So any thoughts from anyone on a custom test vs. enabling errortag
defaults (via fstests or kernel) vs. some combination of both?

Brian

[1] Eric also raised the idea of branching off "test tag" variants of
errortags that might help distinguish injection points that control
behavior vs. those that truly create errors. That could reduce confusion
for testers and whatnot.

I haven't dug into viability, but in theory that could also define a set
of events that don't spew event trigger noise into dmesg if certain
events were to be enabled by default (again, on debug kernels only).

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/7] iomap: optional zero range dirty folio processing
  2025-06-10 12:21     ` Brian Foster
@ 2025-06-10 13:29       ` Christoph Hellwig
  2025-06-10 14:19         ` Brian Foster
  2025-06-10 14:55       ` Darrick J. Wong
  1 sibling, 1 reply; 40+ messages in thread
From: Christoph Hellwig @ 2025-06-10 13:29 UTC (permalink / raw)
  To: Brian Foster; +Cc: Darrick J. Wong, linux-fsdevel, linux-xfs, linux-mm

On Tue, Jun 10, 2025 at 08:21:06AM -0400, Brian Foster wrote:
> Yes.. but I'm not totally sure wrt impact on the fbatch checks quite
> yet. The next thing I wanted to look at is addressing the same unwritten
> mapping vs. dirty folios issue in the seek data/hole path. It's been a
> little while since I last investigated there (and that was also before
> the whole granular advance approach was devised), but IIRC it would look
> rather similar to what this is doing for zero range. That may or may
> not justify just making the batch required for both operations and
> potentially simplifying this logic further. I'll keep that in mind when
> I get to it..

On thing that the batch would be extremely useful for is making
iomap_file_unshare not totally suck by reading in all folios for a
range (not just the dirty ones) similar to the filemap_read path
instead of synchronously reading one block at a time.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC 7/7] xfs: error tag to force zeroing on debug kernels
  2025-06-10 12:26     ` Brian Foster
@ 2025-06-10 13:30       ` Christoph Hellwig
  2025-06-10 14:20         ` Brian Foster
  0 siblings, 1 reply; 40+ messages in thread
From: Christoph Hellwig @ 2025-06-10 13:30 UTC (permalink / raw)
  To: Brian Foster; +Cc: Christoph Hellwig, linux-fsdevel, linux-xfs, linux-mm

On Tue, Jun 10, 2025 at 08:26:45AM -0400, Brian Foster wrote:
> Well that is kind of the question.. ;) My preference was to either add
> something to fstests to enable select errortags by default on every
> mount (or do the same in-kernel via XFS_DEBUG[_ERRTAGS] or some such)
> over just creating a one-off test that runs fsx or whatever with this
> error tag turned on. [1].
> 
> That said, I wouldn't be opposed to just doing both if folks prefer
> that. It just bugs me to add yet another test that only runs a specific
> fsx test when we get much more coverage by running the full suite of
> tests. IOW, whenever somebody is testing a kernel that would actually
> run a custom test (XFS_DEBUG plus specific errortag support), we could
> in theory be running the whole suite with the same errortag turned on
> (albeit perhaps at a lesser frequency than a custom test would use). So
> from that perspective I'm not sure it makes a whole lot of sense to do
> both.
> 
> So any thoughts from anyone on a custom test vs. enabling errortag
> defaults (via fstests or kernel) vs. some combination of both?

I definitively like a targeted test to exercise it.  If you want
additional knows to turn on error tags that's probably fine if it
works out.  I'm worried about adding more flags to xfstests because
it makes it really hard to figure out what runs are need for good
test coverage.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/7] iomap: optional zero range dirty folio processing
  2025-06-10 13:29       ` Christoph Hellwig
@ 2025-06-10 14:19         ` Brian Foster
  2025-06-11  3:54           ` Christoph Hellwig
  0 siblings, 1 reply; 40+ messages in thread
From: Brian Foster @ 2025-06-10 14:19 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Darrick J. Wong, linux-fsdevel, linux-xfs, linux-mm

On Tue, Jun 10, 2025 at 06:29:10AM -0700, Christoph Hellwig wrote:
> On Tue, Jun 10, 2025 at 08:21:06AM -0400, Brian Foster wrote:
> > Yes.. but I'm not totally sure wrt impact on the fbatch checks quite
> > yet. The next thing I wanted to look at is addressing the same unwritten
> > mapping vs. dirty folios issue in the seek data/hole path. It's been a
> > little while since I last investigated there (and that was also before
> > the whole granular advance approach was devised), but IIRC it would look
> > rather similar to what this is doing for zero range. That may or may
> > not justify just making the batch required for both operations and
> > potentially simplifying this logic further. I'll keep that in mind when
> > I get to it..
> 
> On thing that the batch would be extremely useful for is making
> iomap_file_unshare not totally suck by reading in all folios for a
> range (not just the dirty ones) similar to the filemap_read path
> instead of synchronously reading one block at a time.
> 

I can add it to the list to look into. On a quick look though any reason
we wouldn't want to just invoke readahead or something somewhere in that
loop, particularly if that is mainly a performance issue..?

Brian


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC 7/7] xfs: error tag to force zeroing on debug kernels
  2025-06-10 13:30       ` Christoph Hellwig
@ 2025-06-10 14:20         ` Brian Foster
  2025-06-10 19:12           ` Brian Foster
  0 siblings, 1 reply; 40+ messages in thread
From: Brian Foster @ 2025-06-10 14:20 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, linux-xfs, linux-mm

On Tue, Jun 10, 2025 at 06:30:29AM -0700, Christoph Hellwig wrote:
> On Tue, Jun 10, 2025 at 08:26:45AM -0400, Brian Foster wrote:
> > Well that is kind of the question.. ;) My preference was to either add
> > something to fstests to enable select errortags by default on every
> > mount (or do the same in-kernel via XFS_DEBUG[_ERRTAGS] or some such)
> > over just creating a one-off test that runs fsx or whatever with this
> > error tag turned on. [1].
> > 
> > That said, I wouldn't be opposed to just doing both if folks prefer
> > that. It just bugs me to add yet another test that only runs a specific
> > fsx test when we get much more coverage by running the full suite of
> > tests. IOW, whenever somebody is testing a kernel that would actually
> > run a custom test (XFS_DEBUG plus specific errortag support), we could
> > in theory be running the whole suite with the same errortag turned on
> > (albeit perhaps at a lesser frequency than a custom test would use). So
> > from that perspective I'm not sure it makes a whole lot of sense to do
> > both.
> > 
> > So any thoughts from anyone on a custom test vs. enabling errortag
> > defaults (via fstests or kernel) vs. some combination of both?
> 
> I definitively like a targeted test to exercise it.  If you want
> additional knows to turn on error tags that's probably fine if it
> works out.  I'm worried about adding more flags to xfstests because
> it makes it really hard to figure out what runs are need for good
> test coverage.
> 
> 

Yeah, an fstests variable would add yet another configuration to test,
which maybe defeats the point. But we could still turn on certain tags
by default in the kernel. For example, see the couple of open coded
get_random_u32_below() callsites in XFS where we already effectively do
this for XFS_DEBUG, they just aren't implemented as proper errortags.

I think the main thing that would need to change is to not xfs_warn() on
those knobs when they are enabled by default. I think there are a few
different ways that could possibly be done, ideally so we go back to
default/warn behavior when userspace makes an explicit errortag change,
but I'd have to play around with it a little bit. Hm?

Anyways, given the fstests config matrix concern I'm inclined to at
least give something like that a try first and then fall back to a
custom test if that fails or is objectionable for some other reason..

Brian

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/7] iomap: optional zero range dirty folio processing
  2025-06-10 12:21     ` Brian Foster
  2025-06-10 13:29       ` Christoph Hellwig
@ 2025-06-10 14:55       ` Darrick J. Wong
  2025-06-11  3:55         ` Christoph Hellwig
  1 sibling, 1 reply; 40+ messages in thread
From: Darrick J. Wong @ 2025-06-10 14:55 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, linux-xfs, linux-mm

On Tue, Jun 10, 2025 at 08:21:06AM -0400, Brian Foster wrote:
> On Mon, Jun 09, 2025 at 09:04:20AM -0700, Darrick J. Wong wrote:
> > On Thu, Jun 05, 2025 at 01:33:53PM -0400, Brian Foster wrote:
> > > The only way zero range can currently process unwritten mappings
> > > with dirty pagecache is to check whether the range is dirty before
> > > mapping lookup and then flush when at least one underlying mapping
> > > is unwritten. This ordering is required to prevent iomap lookup from
> > > racing with folio writeback and reclaim.
> > > 
> > > Since zero range can skip ranges of unwritten mappings that are
> > > clean in cache, this operation can be improved by allowing the
> > > filesystem to provide a set of dirty folios that require zeroing. In
> > > turn, rather than flush or iterate file offsets, zero range can
> > > iterate on folios in the batch and advance over clean or uncached
> > > ranges in between.
> > > 
> > > Add a folio_batch in struct iomap and provide a helper for fs' to
> > > populate the batch at lookup time. Update the folio lookup path to
> > > return the next folio in the batch, if provided, and advance the
> > > iter if the folio starts beyond the current offset.
> > > 
> > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > ---
> > >  fs/iomap/buffered-io.c | 73 +++++++++++++++++++++++++++++++++++++++---
> > >  fs/iomap/iter.c        |  6 ++++
> > >  include/linux/iomap.h  |  4 +++
> > >  3 files changed, 78 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > > index 16499655e7b0..cf2f4f869920 100644
> > > --- a/fs/iomap/buffered-io.c
> > > +++ b/fs/iomap/buffered-io.c
> > > @@ -750,6 +750,16 @@ static struct folio *__iomap_get_folio(struct iomap_iter *iter, size_t len)
> > >  	if (!mapping_large_folio_support(iter->inode->i_mapping))
> > >  		len = min_t(size_t, len, PAGE_SIZE - offset_in_page(pos));
> > >  
> > > +	if (iter->fbatch) {
> > > +		struct folio *folio = folio_batch_next(iter->fbatch);
> > > +
> > > +		if (folio) {
> > > +			folio_get(folio);
> > > +			folio_lock(folio);
> > 
> > Hrm.  So each folio that is added to the batch isn't locked, nor does
> > the batch (or iomap) hold a refcount on the folio until we get here.  Do
> > we have to re-check that folio->{mapping,index} match what iomap is
> > trying to process?  Or can we assume that nobody has removed the folio
> > from the mapping?
> > 
> 
> The filemap helper grabs a reference to the folio but doesn't
> necessarily lock it. The ref is effectively transferred to the batch
> there and the _get() here creates the iomap reference (i.e. that is
> analogous to the traditional iomap get folio path). The batch is
> ultimately released via folio_batch_release() and the iomap refs dropped
> in the same way regardless of whether iomap grabbed it itself or was
> part of a patch.

Oh, ok, so that's really iomap getting its own ref on the folio to
remain independent of whatever the fbatch code does (or might some day
do).

> > I'm wondering because __filemap_get_folio/filemap_get_entry seem to do
> > all that for us.  I think the folio_pos check below might cover some of
> > that revalidation?
> > 
> 
> I'm not totally sure the folio revalidation is necessarily required
> here.. If it is, I'd also need to think about whether it's ok to skip
> such folios or the approach here needs revisiting. I'll take a closer
> look and also try to document this better and get some feedback from
> people who know this code better in the next go around..

Hrmm.  On closer examination, at least for xfs we've taken i_rwsem and
the invalidate_lock so I think it should be the case that you don't need
to revalidate.  I think the same locks are held for iomap_unshare_range
(mentioned elsewhere in this thread) though it doesn't apply to regular
pagecache writes.

> > > +		}
> > > +		return folio;
> > > +	}
> > > +
> > >  	if (folio_ops && folio_ops->get_folio)
> > >  		return folio_ops->get_folio(iter, pos, len);
> > >  	else
> ...
> > > @@ -819,6 +831,12 @@ static int iomap_write_begin(struct iomap_iter *iter, struct folio **foliop,
> > >  	if (IS_ERR(folio))
> > >  		return PTR_ERR(folio);
> > >  
> > > +	/* no folio means we're done with a batch */
> > 
> > ...ran out of folios but *plen is nonzero, i.e. we still have range to
> > cover?
> > 
> 
> Yes I suppose that is implied by being in this path.. will fix.
> 
> > > +	if (!folio) {
> > > +		WARN_ON_ONCE(!iter->fbatch);
> > > +		return 0;
> > > +	}
> > > +
> > >  	/*
> > >  	 * Now we have a locked folio, before we do anything with it we need to
> > >  	 * check that the iomap we have cached is not stale. The inode extent
> ...
> > > +
> > >  int
> > >  iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
> > >  		const struct iomap_ops *ops, void *private)
> ...
> > > @@ -1445,13 +1503,18 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
> > >  	 * if dirty and the fs returns a mapping that might convert on
> > >  	 * writeback.
> > >  	 */
> > > -	range_dirty = filemap_range_needs_writeback(inode->i_mapping,
> > > -					iter.pos, iter.pos + iter.len - 1);
> > > +	range_dirty = filemap_range_needs_writeback(mapping, iter.pos,
> > > +					iter.pos + iter.len - 1);
> > >  	while ((ret = iomap_iter(&iter, ops)) > 0) {
> > >  		const struct iomap *srcmap = iomap_iter_srcmap(&iter);
> > >  
> > > -		if (srcmap->type == IOMAP_HOLE ||
> > > -		    srcmap->type == IOMAP_UNWRITTEN) {
> > > +		if (WARN_ON_ONCE(iter.fbatch &&
> > > +				 srcmap->type != IOMAP_UNWRITTEN))
> > 
> > I wonder, are you planning to expand the folio batching to other
> > buffered-io.c operations?  Such that the iter.fbatch checks might some
> > day go away?
> > 
> 
> Yes.. but I'm not totally sure wrt impact on the fbatch checks quite
> yet. The next thing I wanted to look at is addressing the same unwritten
> mapping vs. dirty folios issue in the seek data/hole path. It's been a
> little while since I last investigated there (and that was also before
> the whole granular advance approach was devised), but IIRC it would look
> rather similar to what this is doing for zero range. That may or may
> not justify just making the batch required for both operations and
> potentially simplifying this logic further. I'll keep that in mind when
> I get to it..
> 
> After that, I may play around with the buffered write path, but that is
> a larger change with slightly different scope and requirements..

<nod>

--D

> Brian
> 
> > --D
> > 
> > > +			return -EIO;
> > > +
> > > +		if (!iter.fbatch &&
> > > +		    (srcmap->type == IOMAP_HOLE ||
> > > +		     srcmap->type == IOMAP_UNWRITTEN)) {
> > >  			s64 status;
> > >  
> > >  			if (range_dirty) {
> > > diff --git a/fs/iomap/iter.c b/fs/iomap/iter.c
> > > index 6ffc6a7b9ba5..89bd5951a6fd 100644
> > > --- a/fs/iomap/iter.c
> > > +++ b/fs/iomap/iter.c
> > > @@ -9,6 +9,12 @@
> > >  
> > >  static inline void iomap_iter_reset_iomap(struct iomap_iter *iter)
> > >  {
> > > +	if (iter->fbatch) {
> > > +		folio_batch_release(iter->fbatch);
> > > +		kfree(iter->fbatch);
> > > +		iter->fbatch = NULL;
> > > +	}
> > > +
> > >  	iter->status = 0;
> > >  	memset(&iter->iomap, 0, sizeof(iter->iomap));
> > >  	memset(&iter->srcmap, 0, sizeof(iter->srcmap));
> > > diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> > > index 522644d62f30..0b9b460b2873 100644
> > > --- a/include/linux/iomap.h
> > > +++ b/include/linux/iomap.h
> > > @@ -9,6 +9,7 @@
> > >  #include <linux/types.h>
> > >  #include <linux/mm_types.h>
> > >  #include <linux/blkdev.h>
> > > +#include <linux/pagevec.h>
> > >  
> > >  struct address_space;
> > >  struct fiemap_extent_info;
> > > @@ -239,6 +240,7 @@ struct iomap_iter {
> > >  	unsigned flags;
> > >  	struct iomap iomap;
> > >  	struct iomap srcmap;
> > > +	struct folio_batch *fbatch;
> > >  	void *private;
> > >  };
> > >  
> > > @@ -345,6 +347,8 @@ void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len);
> > >  bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio);
> > >  int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
> > >  		const struct iomap_ops *ops);
> > > +loff_t iomap_fill_dirty_folios(struct iomap_iter *iter, loff_t offset,
> > > +		loff_t length);
> > >  int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len,
> > >  		bool *did_zero, const struct iomap_ops *ops, void *private);
> > >  int iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
> > > -- 
> > > 2.49.0
> > > 
> > > 
> > 
> 
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC 7/7] xfs: error tag to force zeroing on debug kernels
  2025-06-10 14:20         ` Brian Foster
@ 2025-06-10 19:12           ` Brian Foster
  2025-06-11  3:56             ` Christoph Hellwig
  0 siblings, 1 reply; 40+ messages in thread
From: Brian Foster @ 2025-06-10 19:12 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, linux-xfs, linux-mm

On Tue, Jun 10, 2025 at 10:20:28AM -0400, Brian Foster wrote:
> On Tue, Jun 10, 2025 at 06:30:29AM -0700, Christoph Hellwig wrote:
> > On Tue, Jun 10, 2025 at 08:26:45AM -0400, Brian Foster wrote:
> > > Well that is kind of the question.. ;) My preference was to either add
> > > something to fstests to enable select errortags by default on every
> > > mount (or do the same in-kernel via XFS_DEBUG[_ERRTAGS] or some such)
> > > over just creating a one-off test that runs fsx or whatever with this
> > > error tag turned on. [1].
> > > 
> > > That said, I wouldn't be opposed to just doing both if folks prefer
> > > that. It just bugs me to add yet another test that only runs a specific
> > > fsx test when we get much more coverage by running the full suite of
> > > tests. IOW, whenever somebody is testing a kernel that would actually
> > > run a custom test (XFS_DEBUG plus specific errortag support), we could
> > > in theory be running the whole suite with the same errortag turned on
> > > (albeit perhaps at a lesser frequency than a custom test would use). So
> > > from that perspective I'm not sure it makes a whole lot of sense to do
> > > both.
> > > 
> > > So any thoughts from anyone on a custom test vs. enabling errortag
> > > defaults (via fstests or kernel) vs. some combination of both?
> > 
> > I definitively like a targeted test to exercise it.  If you want
> > additional knows to turn on error tags that's probably fine if it
> > works out.  I'm worried about adding more flags to xfstests because
> > it makes it really hard to figure out what runs are need for good
> > test coverage.
> > 
> > 
> 
> Yeah, an fstests variable would add yet another configuration to test,
> which maybe defeats the point. But we could still turn on certain tags
> by default in the kernel. For example, see the couple of open coded
> get_random_u32_below() callsites in XFS where we already effectively do
> this for XFS_DEBUG, they just aren't implemented as proper errortags.
> 
> I think the main thing that would need to change is to not xfs_warn() on
> those knobs when they are enabled by default. I think there are a few
> different ways that could possibly be done, ideally so we go back to
> default/warn behavior when userspace makes an explicit errortag change,
> but I'd have to play around with it a little bit. Hm?
> 
> Anyways, given the fstests config matrix concern I'm inclined to at
> least give something like that a try first and then fall back to a
> custom test if that fails or is objectionable for some other reason..
> 
> Brian
> 
> 

Here's a prototype for 1. an errtag quiet mode and 2. on-by-default
tags. The alternative to a per-mount flag would be to hack a new struct
into m_errortag that holds the current randfactor as well as a per-tag
quiet flag, though I'm not sure how much people care about that. I
didn't really plan on exposing this to userspace or anything for per-tag
support, but this does mean all tags would start to warn once userspace
changes any tag. I suppose that could become noisy if some day we end up
with a bunch more default enabled tags. *shrug* I could go either way.

Otherwise I think this would allow conversion of the two open coded
get_random_u32_below() cases and the new force zero tag into
on-by-default errortags. Any thoughts?

--- 8< ---

 diff --git a/fs/xfs/xfs_error.c b/fs/xfs/xfs_error.c
index dbd87e137694..54b38143a7a6 100644
--- a/fs/xfs/xfs_error.c
+++ b/fs/xfs/xfs_error.c
@@ -69,6 +69,7 @@ static unsigned int xfs_errortag_random_default[] = {
 struct xfs_errortag_attr {
 	struct attribute	attr;
 	unsigned int		tag;
+	bool			enable_default;
 };
 
 static inline struct xfs_errortag_attr *
@@ -129,12 +130,15 @@ static const struct sysfs_ops xfs_errortag_sysfs_ops = {
 	.store = xfs_errortag_attr_store,
 };
 
-#define XFS_ERRORTAG_ATTR_RW(_name, _tag) \
+#define __XFS_ERRORTAG_ATTR_RW(_name, _tag, enable) \
 static struct xfs_errortag_attr xfs_errortag_attr_##_name = {		\
 	.attr = {.name = __stringify(_name),				\
 		 .mode = VERIFY_OCTAL_PERMISSIONS(S_IWUSR | S_IRUGO) },	\
 	.tag	= (_tag),						\
+	.enable_default = enable,					\
 }
+#define XFS_ERRORTAG_ATTR_RW(_name, _tag) \
+	__XFS_ERRORTAG_ATTR_RW(_name, _tag, false)
 
 #define XFS_ERRORTAG_ATTR_LIST(_name) &xfs_errortag_attr_##_name.attr
 
@@ -240,6 +244,25 @@ static const struct kobj_type xfs_errortag_ktype = {
 	.default_groups = xfs_errortag_groups,
 };
 
+static void
+xfs_errortag_init_enable_defaults(
+	struct xfs_mount	*mp)
+{
+	int i;
+
+	for (i = 0; xfs_errortag_attrs[i]; i++) {
+		struct xfs_errortag_attr *xfs_attr =
+				to_attr(xfs_errortag_attrs[i]);
+
+		if (!xfs_attr->enable_default)
+			continue;
+
+		xfs_set_quiet_errtag(mp);
+		mp->m_errortag[xfs_attr->tag] =
+			xfs_errortag_random_default[xfs_attr->tag];
+	}
+}
+
 int
 xfs_errortag_init(
 	struct xfs_mount	*mp)
@@ -251,6 +274,8 @@ xfs_errortag_init(
 	if (!mp->m_errortag)
 		return -ENOMEM;
 
+	xfs_errortag_init_enable_defaults(mp);
+
 	ret = xfs_sysfs_init(&mp->m_errortag_kobj, &xfs_errortag_ktype,
 				&mp->m_kobj, "errortag");
 	if (ret)
@@ -320,9 +345,11 @@ xfs_errortag_test(
 	if (!randfactor || get_random_u32_below(randfactor))
 		return false;
 
-	xfs_warn_ratelimited(mp,
+	if (!xfs_is_quiet_errtag(mp)) {
+		xfs_warn_ratelimited(mp,
 "Injecting error (%s) at file %s, line %d, on filesystem \"%s\"",
 			expression, file, line, mp->m_super->s_id);
+	}
 	return true;
 }
 
@@ -346,6 +373,7 @@ xfs_errortag_set(
 	if (!xfs_errortag_valid(error_tag))
 		return -EINVAL;
 
+	xfs_clear_quiet_errtag(mp);
 	mp->m_errortag[error_tag] = tag_value;
 	return 0;
 }
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index d85084f9f317..44b02728056f 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -558,6 +558,8 @@ __XFS_HAS_FEAT(nouuid, NOUUID)
  */
 #define XFS_OPSTATE_BLOCKGC_ENABLED	6
 
+/* Debug kernel skips warning on errtag event triggers */
+#define XFS_OPSTATE_QUIET_ERRTAG	7
 /* Kernel has logged a warning about shrink being used on this fs. */
 #define XFS_OPSTATE_WARNED_SHRINK	9
 /* Kernel has logged a warning about logged xattr updates being used. */
@@ -600,6 +602,7 @@ __XFS_IS_OPSTATE(inode32, INODE32)
 __XFS_IS_OPSTATE(readonly, READONLY)
 __XFS_IS_OPSTATE(inodegc_enabled, INODEGC_ENABLED)
 __XFS_IS_OPSTATE(blockgc_enabled, BLOCKGC_ENABLED)
+__XFS_IS_OPSTATE(quiet_errtag, QUIET_ERRTAG)
 #ifdef CONFIG_XFS_QUOTA
 __XFS_IS_OPSTATE(quotacheck_running, QUOTACHECK_RUNNING)
 __XFS_IS_OPSTATE(resuming_quotaon, RESUMING_QUOTAON)


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/7] iomap: optional zero range dirty folio processing
  2025-06-10 14:19         ` Brian Foster
@ 2025-06-11  3:54           ` Christoph Hellwig
  0 siblings, 0 replies; 40+ messages in thread
From: Christoph Hellwig @ 2025-06-11  3:54 UTC (permalink / raw)
  To: Brian Foster
  Cc: Christoph Hellwig, Darrick J. Wong, linux-fsdevel, linux-xfs,
	linux-mm

On Tue, Jun 10, 2025 at 10:19:35AM -0400, Brian Foster wrote:
> > On thing that the batch would be extremely useful for is making
> > iomap_file_unshare not totally suck by reading in all folios for a
> > range (not just the dirty ones) similar to the filemap_read path
> > instead of synchronously reading one block at a time.
> > 
> 
> I can add it to the list to look into. On a quick look though any reason
> we wouldn't want to just invoke readahead or something somewhere in that
> loop, particularly if that is mainly a performance issue..?

I was planning to look into it once your series lands.

Yes, doing readahead is the main thing.  But once we start doing
that we might as well try to reuse the entire folio_batch optimization
done in the file read path, which has shown to be pretty effective.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/7] iomap: optional zero range dirty folio processing
  2025-06-10 14:55       ` Darrick J. Wong
@ 2025-06-11  3:55         ` Christoph Hellwig
  2025-06-12  4:06           ` Darrick J. Wong
  0 siblings, 1 reply; 40+ messages in thread
From: Christoph Hellwig @ 2025-06-11  3:55 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Brian Foster, linux-fsdevel, linux-xfs, linux-mm

On Tue, Jun 10, 2025 at 07:55:52AM -0700, Darrick J. Wong wrote:
> Hrmm.  On closer examination, at least for xfs we've taken i_rwsem and
> the invalidate_lock so I think it should be the case that you don't need
> to revalidate.  I think the same locks are held for iomap_unshare_range
> (mentioned elsewhere in this thread) though it doesn't apply to regular
> pagecache writes.

We should document these assumptions, preferable using (lockdep)
asserts.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC 7/7] xfs: error tag to force zeroing on debug kernels
  2025-06-10 19:12           ` Brian Foster
@ 2025-06-11  3:56             ` Christoph Hellwig
  0 siblings, 0 replies; 40+ messages in thread
From: Christoph Hellwig @ 2025-06-11  3:56 UTC (permalink / raw)
  To: Brian Foster; +Cc: Christoph Hellwig, linux-fsdevel, linux-xfs, linux-mm

On Tue, Jun 10, 2025 at 03:12:31PM -0400, Brian Foster wrote:
> Here's a prototype for 1. an errtag quiet mode and 2. on-by-default
> tags.

That looks fine to me, but it's not really my area of expertise.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/7] iomap: optional zero range dirty folio processing
  2025-06-11  3:55         ` Christoph Hellwig
@ 2025-06-12  4:06           ` Darrick J. Wong
  0 siblings, 0 replies; 40+ messages in thread
From: Darrick J. Wong @ 2025-06-12  4:06 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Brian Foster, linux-fsdevel, linux-xfs, linux-mm

On Tue, Jun 10, 2025 at 08:55:10PM -0700, Christoph Hellwig wrote:
> On Tue, Jun 10, 2025 at 07:55:52AM -0700, Darrick J. Wong wrote:
> > Hrmm.  On closer examination, at least for xfs we've taken i_rwsem and
> > the invalidate_lock so I think it should be the case that you don't need
> > to revalidate.  I think the same locks are held for iomap_unshare_range
> > (mentioned elsewhere in this thread) though it doesn't apply to regular
> > pagecache writes.
> 
> We should document these assumptions, preferable using (lockdep)
> asserts.

Agreed.  I think most of iomap/buffered-io.c wants the caller to hold
i_rwsem in shared mode for reads; i_rwsem in exclusive mode for writes;
and the invalidate lock for page faults.

The big exception iirc is iomap_zero_range where you need to hold
i_rwsem and the mapping invalidate lock, right?

--D

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 5/7] xfs: fill dirty folios on zero range of unwritten mappings
  2025-06-10 12:24     ` Brian Foster
@ 2025-07-02 18:50       ` Darrick J. Wong
  0 siblings, 0 replies; 40+ messages in thread
From: Darrick J. Wong @ 2025-07-02 18:50 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, linux-xfs, linux-mm

On Tue, Jun 10, 2025 at 08:24:00AM -0400, Brian Foster wrote:
> On Mon, Jun 09, 2025 at 09:12:19AM -0700, Darrick J. Wong wrote:
> > On Thu, Jun 05, 2025 at 01:33:55PM -0400, Brian Foster wrote:
> > > Use the iomap folio batch mechanism to select folios to zero on zero
> > > range of unwritten mappings. Trim the resulting mapping if the batch
> > > is filled (unlikely for current use cases) to distinguish between a
> > > range to skip and one that requires another iteration due to a full
> > > batch.
> > > 
> > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > ---
> > >  fs/xfs/xfs_iomap.c | 23 +++++++++++++++++++++++
> > >  1 file changed, 23 insertions(+)
> > > 
> > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > > index b5cf5bc6308d..63054f7ead0e 100644
> > > --- a/fs/xfs/xfs_iomap.c
> > > +++ b/fs/xfs/xfs_iomap.c
> ...
> > > @@ -1769,6 +1772,26 @@ xfs_buffered_write_iomap_begin(
> > >  		if (offset_fsb < eof_fsb && end_fsb > eof_fsb)
> > >  			end_fsb = eof_fsb;
> > >  
> > > +		/*
> > > +		 * Look up dirty folios for unwritten mappings within EOF.
> > > +		 * Providing this bypasses the flush iomap uses to trigger
> > > +		 * extent conversion when unwritten mappings have dirty
> > > +		 * pagecache in need of zeroing.
> > > +		 *
> > > +		 * Trim the mapping to the end pos of the lookup, which in turn
> > > +		 * was trimmed to the end of the batch if it became full before
> > > +		 * the end of the mapping.
> > > +		 */
> > > +		if (imap.br_state == XFS_EXT_UNWRITTEN &&
> > > +		    offset_fsb < eof_fsb) {
> > > +			loff_t len = min(count,
> > > +					 XFS_FSB_TO_B(mp, imap.br_blockcount));
> > > +
> > > +			end = iomap_fill_dirty_folios(iter, offset, len);
> > 
> > ...though I wonder, does this need to happen in
> > xfs_buffered_write_iomap_begin?  Is it required to hold the ILOCK while
> > we go look for folios in the mapping?  Or could this become a part of
> > iomap_write_begin?
> > 
> 
> Technically it does not need to be inside ->iomap_begin(). The "dirty
> check" just needs to be before the fs drops its own locks associated
> with the mapping lookup to maintain functional correctness, and that
> includes doing it before the callout in the first place (i.e. this is
> how the filemap_range_needs_writeback() logic works). I have various
> older prototype versions of that work that tried to do things a bit more
> generically in that way, but ultimately they seemed less elegant for the
> purpose of zero range.
> 
> WRT zero range, the main reason this is in the callback is that it's
> only required to search for dirty folios when the underlying mapping is
> unwritten, and we don't know that until the filesystem provides the
> mapping (and doing at after the fs drops locks is racy).

<nod>

> That said, if we eventually use this for something like buffered writes,
> that is not so much of an issue and we probably want to instead
> lookup/allocate/lock each successive folio up front. That could likely
> occur at the iomap level (lock ordering issues and whatnot
> notwithstanding).
> 
> The one caveat with zero range is that it's only really used for small
> ranges in practice, so it may not really be that big of a deal if the
> folio lookup occurred unconditionally. I think the justification for
> that is tied to broader using of batching in iomap, however, so I don't
> really want to force the issue unless it proves worthwhile. IOW what I'm
> trying to say is that if we do end up with a few more ops using this
> mechanism, it wouldn't surprise me if we just decided to deduplicate to
> the lowest common denominator implementation at that point (and do the
> lookups in iomap iter or something). We're just not there yet IMO.

<nod> I suppose it could be useful for performance reasons to try to
grab as many folios as we can while we still hold the ILOCK, though we'd
have to be careful about lock inversions.

--D

> 
> Brian
> 
> > --D
> > 
> > > +			end_fsb = min_t(xfs_fileoff_t, end_fsb,
> > > +					XFS_B_TO_FSB(mp, end));
> > > +		}
> > > +
> > >  		xfs_trim_extent(&imap, offset_fsb, end_fsb - offset_fsb);
> > >  	}
> > >  
> > > -- 
> > > 2.49.0
> > > 
> > > 
> > 
> 
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2025-07-02 18:50 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-05 17:33 [PATCH 0/7] iomap: zero range folio batch support Brian Foster
2025-06-05 17:33 ` [PATCH 1/7] iomap: move pos+len BUG_ON() to after folio lookup Brian Foster
2025-06-09 16:16   ` Darrick J. Wong
2025-06-10  4:20     ` Christoph Hellwig
2025-06-10 12:16       ` Brian Foster
2025-06-05 17:33 ` [PATCH 2/7] filemap: add helper to look up dirty folios in a range Brian Foster
2025-06-09 15:48   ` Darrick J. Wong
2025-06-10  4:21     ` Christoph Hellwig
2025-06-10 12:17     ` Brian Foster
2025-06-10  4:22   ` Christoph Hellwig
2025-06-05 17:33 ` [PATCH 3/7] iomap: optional zero range dirty folio processing Brian Foster
2025-06-09 16:04   ` Darrick J. Wong
2025-06-10  4:27     ` Christoph Hellwig
2025-06-10 12:21       ` Brian Foster
2025-06-10 12:21     ` Brian Foster
2025-06-10 13:29       ` Christoph Hellwig
2025-06-10 14:19         ` Brian Foster
2025-06-11  3:54           ` Christoph Hellwig
2025-06-10 14:55       ` Darrick J. Wong
2025-06-11  3:55         ` Christoph Hellwig
2025-06-12  4:06           ` Darrick J. Wong
2025-06-10  4:27   ` Christoph Hellwig
2025-06-05 17:33 ` [PATCH 4/7] xfs: always trim mapping to requested range for zero range Brian Foster
2025-06-09 16:07   ` Darrick J. Wong
2025-06-05 17:33 ` [PATCH 5/7] xfs: fill dirty folios on zero range of unwritten mappings Brian Foster
2025-06-06  2:02   ` kernel test robot
2025-06-06 15:20     ` Brian Foster
2025-06-09 16:12   ` Darrick J. Wong
2025-06-10  4:31     ` Christoph Hellwig
2025-06-10 12:24     ` Brian Foster
2025-07-02 18:50       ` Darrick J. Wong
2025-06-05 17:33 ` [PATCH 6/7] iomap: remove old partial eof zeroing optimization Brian Foster
2025-06-10  4:32   ` Christoph Hellwig
2025-06-05 17:33 ` [PATCH RFC 7/7] xfs: error tag to force zeroing on debug kernels Brian Foster
2025-06-10  4:33   ` Christoph Hellwig
2025-06-10 12:26     ` Brian Foster
2025-06-10 13:30       ` Christoph Hellwig
2025-06-10 14:20         ` Brian Foster
2025-06-10 19:12           ` Brian Foster
2025-06-11  3:56             ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).