[PATCH v5 00/14] fuse: use iomap for buffered reads + readahead

public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v5 00/14] fuse: use iomap for buffered reads + readahead
@ 2025-09-26  0:25 Joanne Koong
  2025-09-26  0:25 ` [PATCH v5 01/14] iomap: move bio read logic into helper function Joanne Koong
                   ` (14 more replies)
  0 siblings, 15 replies; 50+ messages in thread
From: Joanne Koong @ 2025-09-26  0:25 UTC (permalink / raw)
  To: brauner, miklos
  Cc: djwong, hch, hsiangkao, linux-block, gfs2, linux-fsdevel,
	kernel-team, linux-xfs, linux-doc

This series adds fuse iomap support for buffered reads and readahead.
This is needed so that granular uptodate tracking can be used in fuse when
large folios are enabled so that only the non-uptodate portions of the folio
need to be read in instead of having to read in the entire folio. It also is
needed in order to turn on large folios for servers that use the writeback
cache since otherwise there is a race condition that may lead to data
corruption if there is a partial write, then a read and the read happens
before the write has undergone writeback, since otherwise the folio will not
be marked uptodate from the partial write so the read will read in the entire
folio from disk, which will overwrite the partial write.

This is on top of two locally-patched iomap patches [1] [2] patched on top of
commit f1c864be6e88 ("Merge branch 'vfs-6.18.async' into vfs.all") in
Christian's vfs.all tree.

This series was run through fstests on fuse passthrough_hp with an
out-of kernel patch enabling fuse large folios.

This patchset does not enable large folios on fuse yet. That will be part
of a different patchset.

Thanks,
Joanne

[1] https://lore.kernel.org/linux-fsdevel/20250919214250.4144807-1-joannelkoong@gmail.com/
[2] https://lore.kernel.org/linux-fsdevel/20250922180042.1775241-1-joannelkoong@gmail.com/

Changelog
---------
v4: 
https://lore.kernel.org/linux-fsdevel/20250923002353.2961514-1-joannelkoong@gmail.com/
v4 -> v5:
* Add commit for tracking pending read bytes more optimally (patch 7), which
  was suggested by Darrick and improves both the performance and the interface
* Merged "track read/readahead folio ownership internally" patch into patch 7
* Split iomap iter pos change into its own commit (Darrick) (patch 8)

v3:
https://lore.kernel.org/linux-fsdevel/20250916234425.1274735-1-joannelkoong@gmail.com/
v3 -> v4:
* Rebase this on top of patches [1] and [2]
* Fix readahead logic back to checking offset == 0 (patch 4)
* Bias needs to be before/after iomap_iter() (patch 10)
* Rename cur_folio_owned to folio_owned for read_folio (patch 7) (Darrick)

v2:
https://lore.kernel.org/linux-fsdevel/20250908185122.3199171-1-joannelkoong@gmail.com/
v2 -> v3:
* Incorporate Christoph's feedback
- Change naming to iomap_bio_* instead of iomap_xxx_bio
- Take his patch for moving bio logic into its own file (patch 11)
- Make ->read_folio_range interface not need pos arg (patch 9)
- Make ->submit_read return void (patch 9)
- Merge cur_folio_in_bio rename w/ tracking folio_owned internally (patch 7)
- Drop patch propagating error and replace with void return (patch 12)
- Make bias code better to read (patch 10)
* Add WARN_ON_ONCE check in iteration refactoring (patch 4)
* Rename ->read_submit to ->submit_read (patch 9)

v1:
https://lore.kernel.org/linux-fsdevel/20250829235627.4053234-1-joannelkoong@gmail.com/
v1 -> v2:
* Don't pass in caller-provided arg through iter->private, pass it through
  ctx->private instead (Darrick & Christoph)
* Separate 'bias' for ifs->read_bytes_pending into separate patch (Christoph)
* Rework read/readahead interface to take in struct iomap_read_folio_ctx
  (Christoph)
* Add patch for removing fuse fc->blkbits workaround, now that Miklos's tree
  has been merged into Christian's

Joanne Koong (14):
  iomap: move bio read logic into helper function
  iomap: move read/readahead bio submission logic into helper function
  iomap: store read/readahead bio generically
  iomap: iterate over folio mapping in iomap_readpage_iter()
  iomap: rename iomap_readpage_iter() to iomap_read_folio_iter()
  iomap: rename iomap_readpage_ctx struct to iomap_read_folio_ctx
  iomap: track pending read bytes more optimally
  iomap: set accurate iter->pos when reading folio ranges
  iomap: add caller-provided callbacks for read and readahead
  iomap: move buffered io bio logic into new file
  iomap: make iomap_read_folio() a void return
  fuse: use iomap for read_folio
  fuse: use iomap for readahead
  fuse: remove fc->blkbits workaround for partial writes

 .../filesystems/iomap/operations.rst          |  44 +++
 block/fops.c                                  |   5 +-
 fs/erofs/data.c                               |   5 +-
 fs/fuse/dir.c                                 |   2 +-
 fs/fuse/file.c                                | 288 +++++++++++-------
 fs/fuse/fuse_i.h                              |   8 -
 fs/fuse/inode.c                               |  13 +-
 fs/gfs2/aops.c                                |   6 +-
 fs/iomap/Makefile                             |   3 +-
 fs/iomap/bio.c                                |  88 ++++++
 fs/iomap/buffered-io.c                        | 246 +++++++--------
 fs/iomap/internal.h                           |  12 +
 fs/xfs/xfs_aops.c                             |   5 +-
 fs/zonefs/file.c                              |   5 +-
 include/linux/iomap.h                         |  63 +++-
 15 files changed, 505 insertions(+), 288 deletions(-)
 create mode 100644 fs/iomap/bio.c

-- 
2.47.3


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v5 01/14] iomap: move bio read logic into helper function
  2025-09-26  0:25 [PATCH v5 00/14] fuse: use iomap for buffered reads + readahead Joanne Koong
@ 2025-09-26  0:25 ` Joanne Koong
  2025-09-26  0:25 ` [PATCH v5 02/14] iomap: move read/readahead bio submission " Joanne Koong
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 50+ messages in thread
From: Joanne Koong @ 2025-09-26  0:25 UTC (permalink / raw)
  To: brauner, miklos
  Cc: djwong, hch, hsiangkao, linux-block, gfs2, linux-fsdevel,
	kernel-team, linux-xfs, linux-doc

Move the iomap_readpage_iter() bio read logic into a separate helper
function, iomap_bio_read_folio_range(). This is needed to make iomap
read/readahead more generically usable, especially for filesystems that
do not require CONFIG_BLOCK.

Additionally rename buffered write's iomap_read_folio_range() function
to iomap_bio_read_folio_range_sync() to better describe its synchronous
behavior.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/iomap/buffered-io.c | 68 ++++++++++++++++++++++++------------------
 1 file changed, 39 insertions(+), 29 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 9535733ed07a..7e65075b6345 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -367,36 +367,15 @@ struct iomap_readpage_ctx {
 	struct readahead_control *rac;
 };
 
-static int iomap_readpage_iter(struct iomap_iter *iter,
-		struct iomap_readpage_ctx *ctx)
+static void iomap_bio_read_folio_range(const struct iomap_iter *iter,
+		struct iomap_readpage_ctx *ctx, loff_t pos, size_t plen)
 {
+	struct folio *folio = ctx->cur_folio;
 	const struct iomap *iomap = &iter->iomap;
-	loff_t pos = iter->pos;
+	struct iomap_folio_state *ifs = folio->private;
+	size_t poff = offset_in_folio(folio, pos);
 	loff_t length = iomap_length(iter);
-	struct folio *folio = ctx->cur_folio;
-	struct iomap_folio_state *ifs;
-	size_t poff, plen;
 	sector_t sector;
-	int ret;
-
-	if (iomap->type == IOMAP_INLINE) {
-		ret = iomap_read_inline_data(iter, folio);
-		if (ret)
-			return ret;
-		return iomap_iter_advance(iter, length);
-	}
-
-	/* zero post-eof blocks as the page may be mapped */
-	ifs = ifs_alloc(iter->inode, folio, iter->flags);
-	iomap_adjust_read_range(iter->inode, folio, &pos, length, &poff, &plen);
-	if (plen == 0)
-		goto done;
-
-	if (iomap_block_needs_zeroing(iter, pos)) {
-		folio_zero_range(folio, poff, plen);
-		iomap_set_range_uptodate(folio, poff, plen);
-		goto done;
-	}
 
 	ctx->cur_folio_in_bio = true;
 	if (ifs) {
@@ -435,6 +414,37 @@ static int iomap_readpage_iter(struct iomap_iter *iter,
 		ctx->bio->bi_end_io = iomap_read_end_io;
 		bio_add_folio_nofail(ctx->bio, folio, plen, poff);
 	}
+}
+
+static int iomap_readpage_iter(struct iomap_iter *iter,
+		struct iomap_readpage_ctx *ctx)
+{
+	const struct iomap *iomap = &iter->iomap;
+	loff_t pos = iter->pos;
+	loff_t length = iomap_length(iter);
+	struct folio *folio = ctx->cur_folio;
+	size_t poff, plen;
+	int ret;
+
+	if (iomap->type == IOMAP_INLINE) {
+		ret = iomap_read_inline_data(iter, folio);
+		if (ret)
+			return ret;
+		return iomap_iter_advance(iter, length);
+	}
+
+	/* zero post-eof blocks as the page may be mapped */
+	ifs_alloc(iter->inode, folio, iter->flags);
+	iomap_adjust_read_range(iter->inode, folio, &pos, length, &poff, &plen);
+	if (plen == 0)
+		goto done;
+
+	if (iomap_block_needs_zeroing(iter, pos)) {
+		folio_zero_range(folio, poff, plen);
+		iomap_set_range_uptodate(folio, poff, plen);
+	} else {
+		iomap_bio_read_folio_range(iter, ctx, pos, plen);
+	}
 
 done:
 	/*
@@ -559,7 +569,7 @@ void iomap_readahead(struct readahead_control *rac, const struct iomap_ops *ops)
 }
 EXPORT_SYMBOL_GPL(iomap_readahead);
 
-static int iomap_read_folio_range(const struct iomap_iter *iter,
+static int iomap_bio_read_folio_range_sync(const struct iomap_iter *iter,
 		struct folio *folio, loff_t pos, size_t len)
 {
 	const struct iomap *srcmap = iomap_iter_srcmap(iter);
@@ -572,7 +582,7 @@ static int iomap_read_folio_range(const struct iomap_iter *iter,
 	return submit_bio_wait(&bio);
 }
 #else
-static int iomap_read_folio_range(const struct iomap_iter *iter,
+static int iomap_bio_read_folio_range_sync(const struct iomap_iter *iter,
 		struct folio *folio, loff_t pos, size_t len)
 {
 	WARN_ON_ONCE(1);
@@ -749,7 +759,7 @@ static int __iomap_write_begin(const struct iomap_iter *iter,
 				status = write_ops->read_folio_range(iter,
 						folio, block_start, plen);
 			else
-				status = iomap_read_folio_range(iter,
+				status = iomap_bio_read_folio_range_sync(iter,
 						folio, block_start, plen);
 			if (status)
 				return status;
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v5 02/14] iomap: move read/readahead bio submission logic into helper function
  2025-09-26  0:25 [PATCH v5 00/14] fuse: use iomap for buffered reads + readahead Joanne Koong
  2025-09-26  0:25 ` [PATCH v5 01/14] iomap: move bio read logic into helper function Joanne Koong
@ 2025-09-26  0:25 ` Joanne Koong
  2025-09-26  0:25 ` [PATCH v5 03/14] iomap: store read/readahead bio generically Joanne Koong
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 50+ messages in thread
From: Joanne Koong @ 2025-09-26  0:25 UTC (permalink / raw)
  To: brauner, miklos
  Cc: djwong, hch, hsiangkao, linux-block, gfs2, linux-fsdevel,
	kernel-team, linux-xfs, linux-doc

Move the read/readahead bio submission logic into a separate helper.
This is needed to make iomap read/readahead more generically usable,
especially for filesystems that do not require CONFIG_BLOCK.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/iomap/buffered-io.c | 30 ++++++++++++++++--------------
 1 file changed, 16 insertions(+), 14 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 7e65075b6345..f8b985bb5a6b 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -367,6 +367,14 @@ struct iomap_readpage_ctx {
 	struct readahead_control *rac;
 };
 
+static void iomap_bio_submit_read(struct iomap_readpage_ctx *ctx)
+{
+	struct bio *bio = ctx->bio;
+
+	if (bio)
+		submit_bio(bio);
+}
+
 static void iomap_bio_read_folio_range(const struct iomap_iter *iter,
 		struct iomap_readpage_ctx *ctx, loff_t pos, size_t plen)
 {
@@ -392,8 +400,7 @@ static void iomap_bio_read_folio_range(const struct iomap_iter *iter,
 		gfp_t orig_gfp = gfp;
 		unsigned int nr_vecs = DIV_ROUND_UP(length, PAGE_SIZE);
 
-		if (ctx->bio)
-			submit_bio(ctx->bio);
+		iomap_bio_submit_read(ctx);
 
 		if (ctx->rac) /* same as readahead_gfp_mask */
 			gfp |= __GFP_NORETRY | __GFP_NOWARN;
@@ -488,13 +495,10 @@ int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops)
 	while ((ret = iomap_iter(&iter, ops)) > 0)
 		iter.status = iomap_read_folio_iter(&iter, &ctx);
 
-	if (ctx.bio) {
-		submit_bio(ctx.bio);
-		WARN_ON_ONCE(!ctx.cur_folio_in_bio);
-	} else {
-		WARN_ON_ONCE(ctx.cur_folio_in_bio);
+	iomap_bio_submit_read(&ctx);
+
+	if (!ctx.cur_folio_in_bio)
 		folio_unlock(folio);
-	}
 
 	/*
 	 * Just like mpage_readahead and block_read_full_folio, we always
@@ -560,12 +564,10 @@ void iomap_readahead(struct readahead_control *rac, const struct iomap_ops *ops)
 	while (iomap_iter(&iter, ops) > 0)
 		iter.status = iomap_readahead_iter(&iter, &ctx);
 
-	if (ctx.bio)
-		submit_bio(ctx.bio);
-	if (ctx.cur_folio) {
-		if (!ctx.cur_folio_in_bio)
-			folio_unlock(ctx.cur_folio);
-	}
+	iomap_bio_submit_read(&ctx);
+
+	if (ctx.cur_folio && !ctx.cur_folio_in_bio)
+		folio_unlock(ctx.cur_folio);
 }
 EXPORT_SYMBOL_GPL(iomap_readahead);
 
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v5 03/14] iomap: store read/readahead bio generically
  2025-09-26  0:25 [PATCH v5 00/14] fuse: use iomap for buffered reads + readahead Joanne Koong
  2025-09-26  0:25 ` [PATCH v5 01/14] iomap: move bio read logic into helper function Joanne Koong
  2025-09-26  0:25 ` [PATCH v5 02/14] iomap: move read/readahead bio submission " Joanne Koong
@ 2025-09-26  0:25 ` Joanne Koong
  2025-09-26  0:25 ` [PATCH v5 04/14] iomap: iterate over folio mapping in iomap_readpage_iter() Joanne Koong
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 50+ messages in thread
From: Joanne Koong @ 2025-09-26  0:25 UTC (permalink / raw)
  To: brauner, miklos
  Cc: djwong, hch, hsiangkao, linux-block, gfs2, linux-fsdevel,
	kernel-team, linux-xfs, linux-doc

Store the iomap_readpage_ctx bio generically as a "void *read_ctx".
This makes the read/readahead interface more generic, which allows it to
be used by filesystems that may not be block-based and may not have
CONFIG_BLOCK set.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/iomap/buffered-io.c | 29 ++++++++++++++---------------
 1 file changed, 14 insertions(+), 15 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index f8b985bb5a6b..b06b532033ad 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -363,13 +363,13 @@ static void iomap_read_end_io(struct bio *bio)
 struct iomap_readpage_ctx {
 	struct folio		*cur_folio;
 	bool			cur_folio_in_bio;
-	struct bio		*bio;
+	void			*read_ctx;
 	struct readahead_control *rac;
 };
 
 static void iomap_bio_submit_read(struct iomap_readpage_ctx *ctx)
 {
-	struct bio *bio = ctx->bio;
+	struct bio *bio = ctx->read_ctx;
 
 	if (bio)
 		submit_bio(bio);
@@ -384,6 +384,7 @@ static void iomap_bio_read_folio_range(const struct iomap_iter *iter,
 	size_t poff = offset_in_folio(folio, pos);
 	loff_t length = iomap_length(iter);
 	sector_t sector;
+	struct bio *bio = ctx->read_ctx;
 
 	ctx->cur_folio_in_bio = true;
 	if (ifs) {
@@ -393,9 +394,8 @@ static void iomap_bio_read_folio_range(const struct iomap_iter *iter,
 	}
 
 	sector = iomap_sector(iomap, pos);
-	if (!ctx->bio ||
-	    bio_end_sector(ctx->bio) != sector ||
-	    !bio_add_folio(ctx->bio, folio, plen, poff)) {
+	if (!bio || bio_end_sector(bio) != sector ||
+	    !bio_add_folio(bio, folio, plen, poff)) {
 		gfp_t gfp = mapping_gfp_constraint(folio->mapping, GFP_KERNEL);
 		gfp_t orig_gfp = gfp;
 		unsigned int nr_vecs = DIV_ROUND_UP(length, PAGE_SIZE);
@@ -404,22 +404,21 @@ static void iomap_bio_read_folio_range(const struct iomap_iter *iter,
 
 		if (ctx->rac) /* same as readahead_gfp_mask */
 			gfp |= __GFP_NORETRY | __GFP_NOWARN;
-		ctx->bio = bio_alloc(iomap->bdev, bio_max_segs(nr_vecs),
-				     REQ_OP_READ, gfp);
+		bio = bio_alloc(iomap->bdev, bio_max_segs(nr_vecs), REQ_OP_READ,
+				     gfp);
 		/*
 		 * If the bio_alloc fails, try it again for a single page to
 		 * avoid having to deal with partial page reads.  This emulates
 		 * what do_mpage_read_folio does.
 		 */
-		if (!ctx->bio) {
-			ctx->bio = bio_alloc(iomap->bdev, 1, REQ_OP_READ,
-					     orig_gfp);
-		}
+		if (!bio)
+			bio = bio_alloc(iomap->bdev, 1, REQ_OP_READ, orig_gfp);
 		if (ctx->rac)
-			ctx->bio->bi_opf |= REQ_RAHEAD;
-		ctx->bio->bi_iter.bi_sector = sector;
-		ctx->bio->bi_end_io = iomap_read_end_io;
-		bio_add_folio_nofail(ctx->bio, folio, plen, poff);
+			bio->bi_opf |= REQ_RAHEAD;
+		bio->bi_iter.bi_sector = sector;
+		bio->bi_end_io = iomap_read_end_io;
+		bio_add_folio_nofail(bio, folio, plen, poff);
+		ctx->read_ctx = bio;
 	}
 }
 
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v5 04/14] iomap: iterate over folio mapping in iomap_readpage_iter()
  2025-09-26  0:25 [PATCH v5 00/14] fuse: use iomap for buffered reads + readahead Joanne Koong
                   ` (2 preceding siblings ...)
  2025-09-26  0:25 ` [PATCH v5 03/14] iomap: store read/readahead bio generically Joanne Koong
@ 2025-09-26  0:25 ` Joanne Koong
  2025-09-26  0:26 ` [PATCH v5 05/14] iomap: rename iomap_readpage_iter() to iomap_read_folio_iter() Joanne Koong
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 50+ messages in thread
From: Joanne Koong @ 2025-09-26  0:25 UTC (permalink / raw)
  To: brauner, miklos
  Cc: djwong, hch, hsiangkao, linux-block, gfs2, linux-fsdevel,
	kernel-team, linux-xfs, linux-doc, Christoph Hellwig

Iterate over all non-uptodate ranges of a folio mapping in a single call
to iomap_readpage_iter() instead of leaving the partial iteration to the
caller.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/iomap/buffered-io.c | 53 ++++++++++++++++++++----------------------
 1 file changed, 25 insertions(+), 28 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index b06b532033ad..dbe5783ee68c 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -430,6 +430,7 @@ static int iomap_readpage_iter(struct iomap_iter *iter,
 	loff_t length = iomap_length(iter);
 	struct folio *folio = ctx->cur_folio;
 	size_t poff, plen;
+	loff_t count;
 	int ret;
 
 	if (iomap->type == IOMAP_INLINE) {
@@ -439,41 +440,35 @@ static int iomap_readpage_iter(struct iomap_iter *iter,
 		return iomap_iter_advance(iter, length);
 	}
 
-	/* zero post-eof blocks as the page may be mapped */
 	ifs_alloc(iter->inode, folio, iter->flags);
-	iomap_adjust_read_range(iter->inode, folio, &pos, length, &poff, &plen);
-	if (plen == 0)
-		goto done;
 
-	if (iomap_block_needs_zeroing(iter, pos)) {
-		folio_zero_range(folio, poff, plen);
-		iomap_set_range_uptodate(folio, poff, plen);
-	} else {
-		iomap_bio_read_folio_range(iter, ctx, pos, plen);
-	}
+	length = min_t(loff_t, length,
+			folio_size(folio) - offset_in_folio(folio, pos));
+	while (length) {
+		iomap_adjust_read_range(iter->inode, folio, &pos, length, &poff,
+				&plen);
 
-done:
-	/*
-	 * Move the caller beyond our range so that it keeps making progress.
-	 * For that, we have to include any leading non-uptodate ranges, but
-	 * we can skip trailing ones as they will be handled in the next
-	 * iteration.
-	 */
-	length = pos - iter->pos + plen;
-	return iomap_iter_advance(iter, length);
-}
+		count = pos - iter->pos + plen;
+		if (WARN_ON_ONCE(count > length))
+			return -EIO;
 
-static int iomap_read_folio_iter(struct iomap_iter *iter,
-		struct iomap_readpage_ctx *ctx)
-{
-	int ret;
+		if (plen == 0)
+			return iomap_iter_advance(iter, count);
 
-	while (iomap_length(iter)) {
-		ret = iomap_readpage_iter(iter, ctx);
+		/* zero post-eof blocks as the page may be mapped */
+		if (iomap_block_needs_zeroing(iter, pos)) {
+			folio_zero_range(folio, poff, plen);
+			iomap_set_range_uptodate(folio, poff, plen);
+		} else {
+			iomap_bio_read_folio_range(iter, ctx, pos, plen);
+		}
+
+		ret = iomap_iter_advance(iter, count);
 		if (ret)
 			return ret;
+		length -= count;
+		pos = iter->pos;
 	}
-
 	return 0;
 }
 
@@ -492,7 +487,7 @@ int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops)
 	trace_iomap_readpage(iter.inode, 1);
 
 	while ((ret = iomap_iter(&iter, ops)) > 0)
-		iter.status = iomap_read_folio_iter(&iter, &ctx);
+		iter.status = iomap_readpage_iter(&iter, &ctx);
 
 	iomap_bio_submit_read(&ctx);
 
@@ -522,6 +517,8 @@ static int iomap_readahead_iter(struct iomap_iter *iter,
 		}
 		if (!ctx->cur_folio) {
 			ctx->cur_folio = readahead_folio(ctx->rac);
+			if (WARN_ON_ONCE(!ctx->cur_folio))
+				return -EINVAL;
 			ctx->cur_folio_in_bio = false;
 		}
 		ret = iomap_readpage_iter(iter, ctx);
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v5 05/14] iomap: rename iomap_readpage_iter() to iomap_read_folio_iter()
  2025-09-26  0:25 [PATCH v5 00/14] fuse: use iomap for buffered reads + readahead Joanne Koong
                   ` (3 preceding siblings ...)
  2025-09-26  0:25 ` [PATCH v5 04/14] iomap: iterate over folio mapping in iomap_readpage_iter() Joanne Koong
@ 2025-09-26  0:26 ` Joanne Koong
  2025-09-26  0:26 ` [PATCH v5 06/14] iomap: rename iomap_readpage_ctx struct to iomap_read_folio_ctx Joanne Koong
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 50+ messages in thread
From: Joanne Koong @ 2025-09-26  0:26 UTC (permalink / raw)
  To: brauner, miklos
  Cc: djwong, hch, hsiangkao, linux-block, gfs2, linux-fsdevel,
	kernel-team, linux-xfs, linux-doc, Christoph Hellwig

->readpage was deprecated and reads are now on folios.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap/buffered-io.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index dbe5783ee68c..23601373573e 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -422,7 +422,7 @@ static void iomap_bio_read_folio_range(const struct iomap_iter *iter,
 	}
 }
 
-static int iomap_readpage_iter(struct iomap_iter *iter,
+static int iomap_read_folio_iter(struct iomap_iter *iter,
 		struct iomap_readpage_ctx *ctx)
 {
 	const struct iomap *iomap = &iter->iomap;
@@ -487,7 +487,7 @@ int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops)
 	trace_iomap_readpage(iter.inode, 1);
 
 	while ((ret = iomap_iter(&iter, ops)) > 0)
-		iter.status = iomap_readpage_iter(&iter, &ctx);
+		iter.status = iomap_read_folio_iter(&iter, &ctx);
 
 	iomap_bio_submit_read(&ctx);
 
@@ -521,7 +521,7 @@ static int iomap_readahead_iter(struct iomap_iter *iter,
 				return -EINVAL;
 			ctx->cur_folio_in_bio = false;
 		}
-		ret = iomap_readpage_iter(iter, ctx);
+		ret = iomap_read_folio_iter(iter, ctx);
 		if (ret)
 			return ret;
 	}
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v5 06/14] iomap: rename iomap_readpage_ctx struct to iomap_read_folio_ctx
  2025-09-26  0:25 [PATCH v5 00/14] fuse: use iomap for buffered reads + readahead Joanne Koong
                   ` (4 preceding siblings ...)
  2025-09-26  0:26 ` [PATCH v5 05/14] iomap: rename iomap_readpage_iter() to iomap_read_folio_iter() Joanne Koong
@ 2025-09-26  0:26 ` Joanne Koong
  2025-09-26  0:26 ` [PATCH v5 07/14] iomap: track pending read bytes more optimally Joanne Koong
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 50+ messages in thread
From: Joanne Koong @ 2025-09-26  0:26 UTC (permalink / raw)
  To: brauner, miklos
  Cc: djwong, hch, hsiangkao, linux-block, gfs2, linux-fsdevel,
	kernel-team, linux-xfs, linux-doc, Christoph Hellwig

->readpage was deprecated and reads are now on folios.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap/buffered-io.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 23601373573e..09e65771a947 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -360,14 +360,14 @@ static void iomap_read_end_io(struct bio *bio)
 	bio_put(bio);
 }
 
-struct iomap_readpage_ctx {
+struct iomap_read_folio_ctx {
 	struct folio		*cur_folio;
 	bool			cur_folio_in_bio;
 	void			*read_ctx;
 	struct readahead_control *rac;
 };
 
-static void iomap_bio_submit_read(struct iomap_readpage_ctx *ctx)
+static void iomap_bio_submit_read(struct iomap_read_folio_ctx *ctx)
 {
 	struct bio *bio = ctx->read_ctx;
 
@@ -376,7 +376,7 @@ static void iomap_bio_submit_read(struct iomap_readpage_ctx *ctx)
 }
 
 static void iomap_bio_read_folio_range(const struct iomap_iter *iter,
-		struct iomap_readpage_ctx *ctx, loff_t pos, size_t plen)
+		struct iomap_read_folio_ctx *ctx, loff_t pos, size_t plen)
 {
 	struct folio *folio = ctx->cur_folio;
 	const struct iomap *iomap = &iter->iomap;
@@ -423,7 +423,7 @@ static void iomap_bio_read_folio_range(const struct iomap_iter *iter,
 }
 
 static int iomap_read_folio_iter(struct iomap_iter *iter,
-		struct iomap_readpage_ctx *ctx)
+		struct iomap_read_folio_ctx *ctx)
 {
 	const struct iomap *iomap = &iter->iomap;
 	loff_t pos = iter->pos;
@@ -479,7 +479,7 @@ int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops)
 		.pos		= folio_pos(folio),
 		.len		= folio_size(folio),
 	};
-	struct iomap_readpage_ctx ctx = {
+	struct iomap_read_folio_ctx ctx = {
 		.cur_folio	= folio,
 	};
 	int ret;
@@ -504,7 +504,7 @@ int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops)
 EXPORT_SYMBOL_GPL(iomap_read_folio);
 
 static int iomap_readahead_iter(struct iomap_iter *iter,
-		struct iomap_readpage_ctx *ctx)
+		struct iomap_read_folio_ctx *ctx)
 {
 	int ret;
 
@@ -551,7 +551,7 @@ void iomap_readahead(struct readahead_control *rac, const struct iomap_ops *ops)
 		.pos	= readahead_pos(rac),
 		.len	= readahead_length(rac),
 	};
-	struct iomap_readpage_ctx ctx = {
+	struct iomap_read_folio_ctx ctx = {
 		.rac	= rac,
 	};
 
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v5 07/14] iomap: track pending read bytes more optimally
  2025-09-26  0:25 [PATCH v5 00/14] fuse: use iomap for buffered reads + readahead Joanne Koong
                   ` (5 preceding siblings ...)
  2025-09-26  0:26 ` [PATCH v5 06/14] iomap: rename iomap_readpage_ctx struct to iomap_read_folio_ctx Joanne Koong
@ 2025-09-26  0:26 ` Joanne Koong
  2025-10-23 19:34   ` Brian Foster
  2025-09-26  0:26 ` [PATCH v5 08/14] iomap: set accurate iter->pos when reading folio ranges Joanne Koong
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 50+ messages in thread
From: Joanne Koong @ 2025-09-26  0:26 UTC (permalink / raw)
  To: brauner, miklos
  Cc: djwong, hch, hsiangkao, linux-block, gfs2, linux-fsdevel,
	kernel-team, linux-xfs, linux-doc

Instead of incrementing read_bytes_pending for every folio range read in
(which requires acquiring the spinlock to do so), set read_bytes_pending
to the folio size when the first range is asynchronously read in, keep
track of how many bytes total are asynchronously read in, and adjust
read_bytes_pending accordingly after issuing requests to read in all the
necessary ranges.

iomap_read_folio_ctx->cur_folio_in_bio can be removed since a non-zero
value for pending bytes necessarily indicates the folio is in the bio.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Suggested-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/iomap/buffered-io.c | 87 ++++++++++++++++++++++++++++++++----------
 1 file changed, 66 insertions(+), 21 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 09e65771a947..4e6258fdb915 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -362,7 +362,6 @@ static void iomap_read_end_io(struct bio *bio)
 
 struct iomap_read_folio_ctx {
 	struct folio		*cur_folio;
-	bool			cur_folio_in_bio;
 	void			*read_ctx;
 	struct readahead_control *rac;
 };
@@ -380,19 +379,11 @@ static void iomap_bio_read_folio_range(const struct iomap_iter *iter,
 {
 	struct folio *folio = ctx->cur_folio;
 	const struct iomap *iomap = &iter->iomap;
-	struct iomap_folio_state *ifs = folio->private;
 	size_t poff = offset_in_folio(folio, pos);
 	loff_t length = iomap_length(iter);
 	sector_t sector;
 	struct bio *bio = ctx->read_ctx;
 
-	ctx->cur_folio_in_bio = true;
-	if (ifs) {
-		spin_lock_irq(&ifs->state_lock);
-		ifs->read_bytes_pending += plen;
-		spin_unlock_irq(&ifs->state_lock);
-	}
-
 	sector = iomap_sector(iomap, pos);
 	if (!bio || bio_end_sector(bio) != sector ||
 	    !bio_add_folio(bio, folio, plen, poff)) {
@@ -422,8 +413,57 @@ static void iomap_bio_read_folio_range(const struct iomap_iter *iter,
 	}
 }
 
+static void iomap_read_init(struct folio *folio)
+{
+	struct iomap_folio_state *ifs = folio->private;
+
+	if (ifs) {
+		size_t len = folio_size(folio);
+
+		spin_lock_irq(&ifs->state_lock);
+		ifs->read_bytes_pending += len;
+		spin_unlock_irq(&ifs->state_lock);
+	}
+}
+
+static void iomap_read_end(struct folio *folio, size_t bytes_pending)
+{
+	struct iomap_folio_state *ifs;
+
+	/*
+	 * If there are no bytes pending, this means we are responsible for
+	 * unlocking the folio here, since no IO helper has taken ownership of
+	 * it.
+	 */
+	if (!bytes_pending) {
+		folio_unlock(folio);
+		return;
+	}
+
+	ifs = folio->private;
+	if (ifs) {
+		bool end_read, uptodate;
+		size_t bytes_accounted = folio_size(folio) - bytes_pending;
+
+		spin_lock_irq(&ifs->state_lock);
+		ifs->read_bytes_pending -= bytes_accounted;
+		/*
+		 * If !ifs->read_bytes_pending, this means all pending reads
+		 * by the IO helper have already completed, which means we need
+		 * to end the folio read here. If ifs->read_bytes_pending != 0,
+		 * the IO helper will end the folio read.
+		 */
+		end_read = !ifs->read_bytes_pending;
+		if (end_read)
+			uptodate = ifs_is_fully_uptodate(folio, ifs);
+		spin_unlock_irq(&ifs->state_lock);
+		if (end_read)
+			folio_end_read(folio, uptodate);
+	}
+}
+
 static int iomap_read_folio_iter(struct iomap_iter *iter,
-		struct iomap_read_folio_ctx *ctx)
+		struct iomap_read_folio_ctx *ctx, size_t *bytes_pending)
 {
 	const struct iomap *iomap = &iter->iomap;
 	loff_t pos = iter->pos;
@@ -460,6 +500,9 @@ static int iomap_read_folio_iter(struct iomap_iter *iter,
 			folio_zero_range(folio, poff, plen);
 			iomap_set_range_uptodate(folio, poff, plen);
 		} else {
+			if (!*bytes_pending)
+				iomap_read_init(folio);
+			*bytes_pending += plen;
 			iomap_bio_read_folio_range(iter, ctx, pos, plen);
 		}
 
@@ -482,17 +525,18 @@ int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops)
 	struct iomap_read_folio_ctx ctx = {
 		.cur_folio	= folio,
 	};
+	size_t bytes_pending = 0;
 	int ret;
 
 	trace_iomap_readpage(iter.inode, 1);
 
 	while ((ret = iomap_iter(&iter, ops)) > 0)
-		iter.status = iomap_read_folio_iter(&iter, &ctx);
+		iter.status = iomap_read_folio_iter(&iter, &ctx,
+				&bytes_pending);
 
 	iomap_bio_submit_read(&ctx);
 
-	if (!ctx.cur_folio_in_bio)
-		folio_unlock(folio);
+	iomap_read_end(folio, bytes_pending);
 
 	/*
 	 * Just like mpage_readahead and block_read_full_folio, we always
@@ -504,24 +548,23 @@ int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops)
 EXPORT_SYMBOL_GPL(iomap_read_folio);
 
 static int iomap_readahead_iter(struct iomap_iter *iter,
-		struct iomap_read_folio_ctx *ctx)
+		struct iomap_read_folio_ctx *ctx, size_t *cur_bytes_pending)
 {
 	int ret;
 
 	while (iomap_length(iter)) {
 		if (ctx->cur_folio &&
 		    offset_in_folio(ctx->cur_folio, iter->pos) == 0) {
-			if (!ctx->cur_folio_in_bio)
-				folio_unlock(ctx->cur_folio);
+			iomap_read_end(ctx->cur_folio, *cur_bytes_pending);
 			ctx->cur_folio = NULL;
 		}
 		if (!ctx->cur_folio) {
 			ctx->cur_folio = readahead_folio(ctx->rac);
 			if (WARN_ON_ONCE(!ctx->cur_folio))
 				return -EINVAL;
-			ctx->cur_folio_in_bio = false;
+			*cur_bytes_pending = 0;
 		}
-		ret = iomap_read_folio_iter(iter, ctx);
+		ret = iomap_read_folio_iter(iter, ctx, cur_bytes_pending);
 		if (ret)
 			return ret;
 	}
@@ -554,16 +597,18 @@ void iomap_readahead(struct readahead_control *rac, const struct iomap_ops *ops)
 	struct iomap_read_folio_ctx ctx = {
 		.rac	= rac,
 	};
+	size_t cur_bytes_pending;
 
 	trace_iomap_readahead(rac->mapping->host, readahead_count(rac));
 
 	while (iomap_iter(&iter, ops) > 0)
-		iter.status = iomap_readahead_iter(&iter, &ctx);
+		iter.status = iomap_readahead_iter(&iter, &ctx,
+					&cur_bytes_pending);
 
 	iomap_bio_submit_read(&ctx);
 
-	if (ctx.cur_folio && !ctx.cur_folio_in_bio)
-		folio_unlock(ctx.cur_folio);
+	if (ctx.cur_folio)
+		iomap_read_end(ctx.cur_folio, cur_bytes_pending);
 }
 EXPORT_SYMBOL_GPL(iomap_readahead);
 
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v5 08/14] iomap: set accurate iter->pos when reading folio ranges
  2025-09-26  0:25 [PATCH v5 00/14] fuse: use iomap for buffered reads + readahead Joanne Koong
                   ` (6 preceding siblings ...)
  2025-09-26  0:26 ` [PATCH v5 07/14] iomap: track pending read bytes more optimally Joanne Koong
@ 2025-09-26  0:26 ` Joanne Koong
  2025-09-26  0:26 ` [PATCH v5 09/14] iomap: add caller-provided callbacks for read and readahead Joanne Koong
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 50+ messages in thread
From: Joanne Koong @ 2025-09-26  0:26 UTC (permalink / raw)
  To: brauner, miklos
  Cc: djwong, hch, hsiangkao, linux-block, gfs2, linux-fsdevel,
	kernel-team, linux-xfs, linux-doc, Christoph Hellwig

Advance iter to the correct position before calling an IO helper to read
in a folio range. This allows the helper to reliably use iter->pos to
determine the starting offset for reading.

This will simplify the interface for reading in folio ranges when iomap
read/readahead supports caller-provided callbacks.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Suggested-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap/buffered-io.c | 21 +++++++++++++--------
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 4e6258fdb915..82bdf7c5e03c 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -375,10 +375,11 @@ static void iomap_bio_submit_read(struct iomap_read_folio_ctx *ctx)
 }
 
 static void iomap_bio_read_folio_range(const struct iomap_iter *iter,
-		struct iomap_read_folio_ctx *ctx, loff_t pos, size_t plen)
+		struct iomap_read_folio_ctx *ctx, size_t plen)
 {
 	struct folio *folio = ctx->cur_folio;
 	const struct iomap *iomap = &iter->iomap;
+	loff_t pos = iter->pos;
 	size_t poff = offset_in_folio(folio, pos);
 	loff_t length = iomap_length(iter);
 	sector_t sector;
@@ -470,7 +471,7 @@ static int iomap_read_folio_iter(struct iomap_iter *iter,
 	loff_t length = iomap_length(iter);
 	struct folio *folio = ctx->cur_folio;
 	size_t poff, plen;
-	loff_t count;
+	loff_t pos_diff;
 	int ret;
 
 	if (iomap->type == IOMAP_INLINE) {
@@ -488,12 +489,16 @@ static int iomap_read_folio_iter(struct iomap_iter *iter,
 		iomap_adjust_read_range(iter->inode, folio, &pos, length, &poff,
 				&plen);
 
-		count = pos - iter->pos + plen;
-		if (WARN_ON_ONCE(count > length))
+		pos_diff = pos - iter->pos;
+		if (WARN_ON_ONCE(pos_diff + plen > length))
 			return -EIO;
 
+		ret = iomap_iter_advance(iter, pos_diff);
+		if (ret)
+			return ret;
+
 		if (plen == 0)
-			return iomap_iter_advance(iter, count);
+			return 0;
 
 		/* zero post-eof blocks as the page may be mapped */
 		if (iomap_block_needs_zeroing(iter, pos)) {
@@ -503,13 +508,13 @@ static int iomap_read_folio_iter(struct iomap_iter *iter,
 			if (!*bytes_pending)
 				iomap_read_init(folio);
 			*bytes_pending += plen;
-			iomap_bio_read_folio_range(iter, ctx, pos, plen);
+			iomap_bio_read_folio_range(iter, ctx, plen);
 		}
 
-		ret = iomap_iter_advance(iter, count);
+		ret = iomap_iter_advance(iter, plen);
 		if (ret)
 			return ret;
-		length -= count;
+		length -= pos_diff + plen;
 		pos = iter->pos;
 	}
 	return 0;
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v5 09/14] iomap: add caller-provided callbacks for read and readahead
  2025-09-26  0:25 [PATCH v5 00/14] fuse: use iomap for buffered reads + readahead Joanne Koong
                   ` (7 preceding siblings ...)
  2025-09-26  0:26 ` [PATCH v5 08/14] iomap: set accurate iter->pos when reading folio ranges Joanne Koong
@ 2025-09-26  0:26 ` Joanne Koong
  2025-09-26  0:26 ` [PATCH v5 10/14] iomap: move buffered io bio logic into new file Joanne Koong
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 50+ messages in thread
From: Joanne Koong @ 2025-09-26  0:26 UTC (permalink / raw)
  To: brauner, miklos
  Cc: djwong, hch, hsiangkao, linux-block, gfs2, linux-fsdevel,
	kernel-team, linux-xfs, linux-doc

Add caller-provided callbacks for read and readahead so that it can be
used generically, especially by filesystems that are not block-based.

In particular, this:
* Modifies the read and readahead interface to take in a
  struct iomap_read_folio_ctx that is publicly defined as:

  struct iomap_read_folio_ctx {
	const struct iomap_read_ops *ops;
	struct folio *cur_folio;
	struct readahead_control *rac;
	void *read_ctx;
  };

  where struct iomap_read_ops is defined as:

  struct iomap_read_ops {
      int (*read_folio_range)(const struct iomap_iter *iter,
                             struct iomap_read_folio_ctx *ctx,
                             size_t len);
      void (*read_submit)(struct iomap_read_folio_ctx *ctx);
  };

  read_folio_range() reads in the folio range and is required by the
  caller to provide. read_submit() is optional and is used for
  submitting any pending read requests.

* Modifies existing filesystems that use iomap for read and readahead to
  use the new API, through the new statically inlined helpers
  iomap_bio_read_folio() and iomap_bio_readahead(). There is no change
  in functionality for those filesystems.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
 .../filesystems/iomap/operations.rst          | 44 +++++++++++++
 block/fops.c                                  |  5 +-
 fs/erofs/data.c                               |  5 +-
 fs/gfs2/aops.c                                |  6 +-
 fs/iomap/buffered-io.c                        | 55 ++++++++--------
 fs/xfs/xfs_aops.c                             |  5 +-
 fs/zonefs/file.c                              |  5 +-
 include/linux/iomap.h                         | 63 ++++++++++++++++++-
 8 files changed, 149 insertions(+), 39 deletions(-)

diff --git a/Documentation/filesystems/iomap/operations.rst b/Documentation/filesystems/iomap/operations.rst
index 067ed8e14ef3..cef3c3e76e9e 100644
--- a/Documentation/filesystems/iomap/operations.rst
+++ b/Documentation/filesystems/iomap/operations.rst
@@ -135,6 +135,28 @@ These ``struct kiocb`` flags are significant for buffered I/O with iomap:
 
  * ``IOCB_DONTCACHE``: Turns on ``IOMAP_DONTCACHE``.
 
+``struct iomap_read_ops``
+--------------------------
+
+.. code-block:: c
+
+ struct iomap_read_ops {
+     int (*read_folio_range)(const struct iomap_iter *iter,
+                             struct iomap_read_folio_ctx *ctx, size_t len);
+     void (*submit_read)(struct iomap_read_folio_ctx *ctx);
+ };
+
+iomap calls these functions:
+
+  - ``read_folio_range``: Called to read in the range. This must be provided
+    by the caller. The caller is responsible for calling
+    iomap_finish_folio_read() after reading in the folio range. This should be
+    done even if an error is encountered during the read. This returns 0 on
+    success or a negative error on failure.
+
+  - ``submit_read``: Submit any pending read requests. This function is
+    optional.
+
 Internal per-Folio State
 ------------------------
 
@@ -182,6 +204,28 @@ The ``flags`` argument to ``->iomap_begin`` will be set to zero.
 The pagecache takes whatever locks it needs before calling the
 filesystem.
 
+Both ``iomap_readahead`` and ``iomap_read_folio`` pass in a ``struct
+iomap_read_folio_ctx``:
+
+.. code-block:: c
+
+ struct iomap_read_folio_ctx {
+    const struct iomap_read_ops *ops;
+    struct folio *cur_folio;
+    struct readahead_control *rac;
+    void *read_ctx;
+ };
+
+``iomap_readahead`` must set:
+ * ``ops->read_folio_range()`` and ``rac``
+
+``iomap_read_folio`` must set:
+ * ``ops->read_folio_range()`` and ``cur_folio``
+
+``ops->submit_read()`` and ``read_ctx`` are optional. ``read_ctx`` is used to
+pass in any custom data the caller needs accessible in the ops callbacks for
+fulfilling reads.
+
 Buffered Writes
 ---------------
 
diff --git a/block/fops.c b/block/fops.c
index ddbc69c0922b..a2c2391d8dfa 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -533,12 +533,13 @@ const struct address_space_operations def_blk_aops = {
 #else /* CONFIG_BUFFER_HEAD */
 static int blkdev_read_folio(struct file *file, struct folio *folio)
 {
-	return iomap_read_folio(folio, &blkdev_iomap_ops);
+	iomap_bio_read_folio(folio, &blkdev_iomap_ops);
+	return 0;
 }
 
 static void blkdev_readahead(struct readahead_control *rac)
 {
-	iomap_readahead(rac, &blkdev_iomap_ops);
+	iomap_bio_readahead(rac, &blkdev_iomap_ops);
 }
 
 static ssize_t blkdev_writeback_range(struct iomap_writepage_ctx *wpc,
diff --git a/fs/erofs/data.c b/fs/erofs/data.c
index 3b1ba571c728..be4191b33321 100644
--- a/fs/erofs/data.c
+++ b/fs/erofs/data.c
@@ -371,7 +371,8 @@ static int erofs_read_folio(struct file *file, struct folio *folio)
 {
 	trace_erofs_read_folio(folio, true);
 
-	return iomap_read_folio(folio, &erofs_iomap_ops);
+	iomap_bio_read_folio(folio, &erofs_iomap_ops);
+	return 0;
 }
 
 static void erofs_readahead(struct readahead_control *rac)
@@ -379,7 +380,7 @@ static void erofs_readahead(struct readahead_control *rac)
 	trace_erofs_readahead(rac->mapping->host, readahead_index(rac),
 					readahead_count(rac), true);
 
-	return iomap_readahead(rac, &erofs_iomap_ops);
+	iomap_bio_readahead(rac, &erofs_iomap_ops);
 }
 
 static sector_t erofs_bmap(struct address_space *mapping, sector_t block)
diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 47d74afd63ac..38d4f343187a 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -424,11 +424,11 @@ static int gfs2_read_folio(struct file *file, struct folio *folio)
 	struct inode *inode = folio->mapping->host;
 	struct gfs2_inode *ip = GFS2_I(inode);
 	struct gfs2_sbd *sdp = GFS2_SB(inode);
-	int error;
+	int error = 0;
 
 	if (!gfs2_is_jdata(ip) ||
 	    (i_blocksize(inode) == PAGE_SIZE && !folio_buffers(folio))) {
-		error = iomap_read_folio(folio, &gfs2_iomap_ops);
+		iomap_bio_read_folio(folio, &gfs2_iomap_ops);
 	} else if (gfs2_is_stuffed(ip)) {
 		error = stuffed_read_folio(ip, folio);
 	} else {
@@ -503,7 +503,7 @@ static void gfs2_readahead(struct readahead_control *rac)
 	else if (gfs2_is_jdata(ip))
 		mpage_readahead(rac, gfs2_block_map);
 	else
-		iomap_readahead(rac, &gfs2_iomap_ops);
+		iomap_bio_readahead(rac, &gfs2_iomap_ops);
 }
 
 /**
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 82bdf7c5e03c..9e1f1f0f8bf1 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -328,8 +328,8 @@ static int iomap_read_inline_data(const struct iomap_iter *iter,
 }
 
 #ifdef CONFIG_BLOCK
-static void iomap_finish_folio_read(struct folio *folio, size_t off,
-		size_t len, int error)
+void iomap_finish_folio_read(struct folio *folio, size_t off, size_t len,
+		int error)
 {
 	struct iomap_folio_state *ifs = folio->private;
 	bool uptodate = !error;
@@ -349,6 +349,7 @@ static void iomap_finish_folio_read(struct folio *folio, size_t off,
 	if (finished)
 		folio_end_read(folio, uptodate);
 }
+EXPORT_SYMBOL_GPL(iomap_finish_folio_read);
 
 static void iomap_read_end_io(struct bio *bio)
 {
@@ -360,12 +361,6 @@ static void iomap_read_end_io(struct bio *bio)
 	bio_put(bio);
 }
 
-struct iomap_read_folio_ctx {
-	struct folio		*cur_folio;
-	void			*read_ctx;
-	struct readahead_control *rac;
-};
-
 static void iomap_bio_submit_read(struct iomap_read_folio_ctx *ctx)
 {
 	struct bio *bio = ctx->read_ctx;
@@ -374,7 +369,7 @@ static void iomap_bio_submit_read(struct iomap_read_folio_ctx *ctx)
 		submit_bio(bio);
 }
 
-static void iomap_bio_read_folio_range(const struct iomap_iter *iter,
+static int iomap_bio_read_folio_range(const struct iomap_iter *iter,
 		struct iomap_read_folio_ctx *ctx, size_t plen)
 {
 	struct folio *folio = ctx->cur_folio;
@@ -412,8 +407,15 @@ static void iomap_bio_read_folio_range(const struct iomap_iter *iter,
 		bio_add_folio_nofail(bio, folio, plen, poff);
 		ctx->read_ctx = bio;
 	}
+	return 0;
 }
 
+const struct iomap_read_ops iomap_bio_read_ops = {
+	.read_folio_range	= iomap_bio_read_folio_range,
+	.submit_read		= iomap_bio_submit_read,
+};
+EXPORT_SYMBOL_GPL(iomap_bio_read_ops);
+
 static void iomap_read_init(struct folio *folio)
 {
 	struct iomap_folio_state *ifs = folio->private;
@@ -508,7 +510,9 @@ static int iomap_read_folio_iter(struct iomap_iter *iter,
 			if (!*bytes_pending)
 				iomap_read_init(folio);
 			*bytes_pending += plen;
-			iomap_bio_read_folio_range(iter, ctx, plen);
+			ret = ctx->ops->read_folio_range(iter, ctx, plen);
+			if (ret)
+				return ret;
 		}
 
 		ret = iomap_iter_advance(iter, plen);
@@ -520,26 +524,25 @@ static int iomap_read_folio_iter(struct iomap_iter *iter,
 	return 0;
 }
 
-int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops)
+int iomap_read_folio(const struct iomap_ops *ops,
+		struct iomap_read_folio_ctx *ctx)
 {
+	struct folio *folio = ctx->cur_folio;
 	struct iomap_iter iter = {
 		.inode		= folio->mapping->host,
 		.pos		= folio_pos(folio),
 		.len		= folio_size(folio),
 	};
-	struct iomap_read_folio_ctx ctx = {
-		.cur_folio	= folio,
-	};
 	size_t bytes_pending = 0;
 	int ret;
 
 	trace_iomap_readpage(iter.inode, 1);
 
 	while ((ret = iomap_iter(&iter, ops)) > 0)
-		iter.status = iomap_read_folio_iter(&iter, &ctx,
-				&bytes_pending);
+		iter.status = iomap_read_folio_iter(&iter, ctx, &bytes_pending);
 
-	iomap_bio_submit_read(&ctx);
+	if (ctx->ops->submit_read)
+		ctx->ops->submit_read(ctx);
 
 	iomap_read_end(folio, bytes_pending);
 
@@ -579,8 +582,8 @@ static int iomap_readahead_iter(struct iomap_iter *iter,
 
 /**
  * iomap_readahead - Attempt to read pages from a file.
- * @rac: Describes the pages to be read.
  * @ops: The operations vector for the filesystem.
+ * @ctx: The ctx used for issuing readahead.
  *
  * This function is for filesystems to call to implement their readahead
  * address_space operation.
@@ -592,28 +595,28 @@ static int iomap_readahead_iter(struct iomap_iter *iter,
  * function is called with memalloc_nofs set, so allocations will not cause
  * the filesystem to be reentered.
  */
-void iomap_readahead(struct readahead_control *rac, const struct iomap_ops *ops)
+void iomap_readahead(const struct iomap_ops *ops,
+		struct iomap_read_folio_ctx *ctx)
 {
+	struct readahead_control *rac = ctx->rac;
 	struct iomap_iter iter = {
 		.inode	= rac->mapping->host,
 		.pos	= readahead_pos(rac),
 		.len	= readahead_length(rac),
 	};
-	struct iomap_read_folio_ctx ctx = {
-		.rac	= rac,
-	};
 	size_t cur_bytes_pending;
 
 	trace_iomap_readahead(rac->mapping->host, readahead_count(rac));
 
 	while (iomap_iter(&iter, ops) > 0)
-		iter.status = iomap_readahead_iter(&iter, &ctx,
+		iter.status = iomap_readahead_iter(&iter, ctx,
 					&cur_bytes_pending);
 
-	iomap_bio_submit_read(&ctx);
+	if (ctx->ops->submit_read)
+		ctx->ops->submit_read(ctx);
 
-	if (ctx.cur_folio)
-		iomap_read_end(ctx.cur_folio, cur_bytes_pending);
+	if (ctx->cur_folio)
+		iomap_read_end(ctx->cur_folio, cur_bytes_pending);
 }
 EXPORT_SYMBOL_GPL(iomap_readahead);
 
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index a26f79815533..0c2ed00733f2 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -742,14 +742,15 @@ xfs_vm_read_folio(
 	struct file		*unused,
 	struct folio		*folio)
 {
-	return iomap_read_folio(folio, &xfs_read_iomap_ops);
+	iomap_bio_read_folio(folio, &xfs_read_iomap_ops);
+	return 0;
 }
 
 STATIC void
 xfs_vm_readahead(
 	struct readahead_control	*rac)
 {
-	iomap_readahead(rac, &xfs_read_iomap_ops);
+	iomap_bio_readahead(rac, &xfs_read_iomap_ops);
 }
 
 static int
diff --git a/fs/zonefs/file.c b/fs/zonefs/file.c
index fd3a5922f6c3..4d6e7eb52966 100644
--- a/fs/zonefs/file.c
+++ b/fs/zonefs/file.c
@@ -112,12 +112,13 @@ static const struct iomap_ops zonefs_write_iomap_ops = {
 
 static int zonefs_read_folio(struct file *unused, struct folio *folio)
 {
-	return iomap_read_folio(folio, &zonefs_read_iomap_ops);
+	iomap_bio_read_folio(folio, &zonefs_read_iomap_ops);
+	return 0;
 }
 
 static void zonefs_readahead(struct readahead_control *rac)
 {
-	iomap_readahead(rac, &zonefs_read_iomap_ops);
+	iomap_bio_readahead(rac, &zonefs_read_iomap_ops);
 }
 
 /*
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 4469b2318b08..37435b912755 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -16,6 +16,7 @@ struct inode;
 struct iomap_iter;
 struct iomap_dio;
 struct iomap_writepage_ctx;
+struct iomap_read_folio_ctx;
 struct iov_iter;
 struct kiocb;
 struct page;
@@ -337,8 +338,10 @@ static inline bool iomap_want_unshare_iter(const struct iomap_iter *iter)
 ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from,
 		const struct iomap_ops *ops,
 		const struct iomap_write_ops *write_ops, void *private);
-int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops);
-void iomap_readahead(struct readahead_control *, const struct iomap_ops *ops);
+int iomap_read_folio(const struct iomap_ops *ops,
+		struct iomap_read_folio_ctx *ctx);
+void iomap_readahead(const struct iomap_ops *ops,
+		struct iomap_read_folio_ctx *ctx);
 bool iomap_is_partially_uptodate(struct folio *, size_t from, size_t count);
 struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len);
 bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags);
@@ -465,6 +468,8 @@ ssize_t iomap_add_to_ioend(struct iomap_writepage_ctx *wpc, struct folio *folio,
 		loff_t pos, loff_t end_pos, unsigned int dirty_len);
 int iomap_ioend_writeback_submit(struct iomap_writepage_ctx *wpc, int error);
 
+void iomap_finish_folio_read(struct folio *folio, size_t off, size_t len,
+		int error);
 void iomap_start_folio_write(struct inode *inode, struct folio *folio,
 		size_t len);
 void iomap_finish_folio_write(struct inode *inode, struct folio *folio,
@@ -473,6 +478,34 @@ void iomap_finish_folio_write(struct inode *inode, struct folio *folio,
 int iomap_writeback_folio(struct iomap_writepage_ctx *wpc, struct folio *folio);
 int iomap_writepages(struct iomap_writepage_ctx *wpc);
 
+struct iomap_read_folio_ctx {
+	const struct iomap_read_ops *ops;
+	struct folio		*cur_folio;
+	struct readahead_control *rac;
+	void			*read_ctx;
+};
+
+struct iomap_read_ops {
+	/*
+	 * Read in a folio range.
+	 *
+	 * The caller is responsible for calling iomap_finish_folio_read() after
+	 * reading in the folio range. This should be done even if an error is
+	 * encountered during the read.
+	 *
+	 * Returns 0 on success or a negative error on failure.
+	 */
+	int (*read_folio_range)(const struct iomap_iter *iter,
+			struct iomap_read_folio_ctx *ctx, size_t len);
+
+	/*
+	 * Submit any pending read requests.
+	 *
+	 * This is optional.
+	 */
+	void (*submit_read)(struct iomap_read_folio_ctx *ctx);
+};
+
 /*
  * Flags for direct I/O ->end_io:
  */
@@ -538,4 +571,30 @@ int iomap_swapfile_activate(struct swap_info_struct *sis,
 
 extern struct bio_set iomap_ioend_bioset;
 
+#ifdef CONFIG_BLOCK
+extern const struct iomap_read_ops iomap_bio_read_ops;
+
+static inline void iomap_bio_read_folio(struct folio *folio,
+		const struct iomap_ops *ops)
+{
+	struct iomap_read_folio_ctx ctx = {
+		.ops		= &iomap_bio_read_ops,
+		.cur_folio	= folio,
+	};
+
+	iomap_read_folio(ops, &ctx);
+}
+
+static inline void iomap_bio_readahead(struct readahead_control *rac,
+		const struct iomap_ops *ops)
+{
+	struct iomap_read_folio_ctx ctx = {
+		.ops		= &iomap_bio_read_ops,
+		.rac		= rac,
+	};
+
+	iomap_readahead(ops, &ctx);
+}
+#endif /* CONFIG_BLOCK */
+
 #endif /* LINUX_IOMAP_H */
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v5 10/14] iomap: move buffered io bio logic into new file
  2025-09-26  0:25 [PATCH v5 00/14] fuse: use iomap for buffered reads + readahead Joanne Koong
                   ` (8 preceding siblings ...)
  2025-09-26  0:26 ` [PATCH v5 09/14] iomap: add caller-provided callbacks for read and readahead Joanne Koong
@ 2025-09-26  0:26 ` Joanne Koong
  2025-09-26  0:26 ` [PATCH v5 11/14] iomap: make iomap_read_folio() a void return Joanne Koong
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 50+ messages in thread
From: Joanne Koong @ 2025-09-26  0:26 UTC (permalink / raw)
  To: brauner, miklos
  Cc: djwong, hch, hsiangkao, linux-block, gfs2, linux-fsdevel,
	kernel-team, linux-xfs, linux-doc

From: Christoph Hellwig <hch@lst.de> [1]

Move bio logic in the buffered io code into its own file and remove
CONFIG_BLOCK gating for iomap read/readahead.

[1] https://lore.kernel.org/linux-fsdevel/aMK2GuumUf93ep99@infradead.org/

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/iomap/Makefile      |  3 +-
 fs/iomap/bio.c         | 88 ++++++++++++++++++++++++++++++++++++++++++
 fs/iomap/buffered-io.c | 88 +-----------------------------------------
 fs/iomap/internal.h    | 12 ++++++
 4 files changed, 103 insertions(+), 88 deletions(-)
 create mode 100644 fs/iomap/bio.c

diff --git a/fs/iomap/Makefile b/fs/iomap/Makefile
index f7e1c8534c46..a572b8808524 100644
--- a/fs/iomap/Makefile
+++ b/fs/iomap/Makefile
@@ -14,5 +14,6 @@ iomap-y				+= trace.o \
 iomap-$(CONFIG_BLOCK)		+= direct-io.o \
 				   ioend.o \
 				   fiemap.o \
-				   seek.o
+				   seek.o \
+				   bio.o
 iomap-$(CONFIG_SWAP)		+= swapfile.o
diff --git a/fs/iomap/bio.c b/fs/iomap/bio.c
new file mode 100644
index 000000000000..fc045f2e4c45
--- /dev/null
+++ b/fs/iomap/bio.c
@@ -0,0 +1,88 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2010 Red Hat, Inc.
+ * Copyright (C) 2016-2023 Christoph Hellwig.
+ */
+#include <linux/iomap.h>
+#include <linux/pagemap.h>
+#include "internal.h"
+#include "trace.h"
+
+static void iomap_read_end_io(struct bio *bio)
+{
+	int error = blk_status_to_errno(bio->bi_status);
+	struct folio_iter fi;
+
+	bio_for_each_folio_all(fi, bio)
+		iomap_finish_folio_read(fi.folio, fi.offset, fi.length, error);
+	bio_put(bio);
+}
+
+static void iomap_bio_submit_read(struct iomap_read_folio_ctx *ctx)
+{
+	struct bio *bio = ctx->read_ctx;
+
+	if (bio)
+		submit_bio(bio);
+}
+
+static int iomap_bio_read_folio_range(const struct iomap_iter *iter,
+		struct iomap_read_folio_ctx *ctx, size_t plen)
+{
+	struct folio *folio = ctx->cur_folio;
+	const struct iomap *iomap = &iter->iomap;
+	loff_t pos = iter->pos;
+	size_t poff = offset_in_folio(folio, pos);
+	loff_t length = iomap_length(iter);
+	sector_t sector;
+	struct bio *bio = ctx->read_ctx;
+
+	sector = iomap_sector(iomap, pos);
+	if (!bio || bio_end_sector(bio) != sector ||
+	    !bio_add_folio(bio, folio, plen, poff)) {
+		gfp_t gfp = mapping_gfp_constraint(folio->mapping, GFP_KERNEL);
+		gfp_t orig_gfp = gfp;
+		unsigned int nr_vecs = DIV_ROUND_UP(length, PAGE_SIZE);
+
+		if (bio)
+			submit_bio(bio);
+
+		if (ctx->rac) /* same as readahead_gfp_mask */
+			gfp |= __GFP_NORETRY | __GFP_NOWARN;
+		bio = bio_alloc(iomap->bdev, bio_max_segs(nr_vecs), REQ_OP_READ,
+				     gfp);
+		/*
+		 * If the bio_alloc fails, try it again for a single page to
+		 * avoid having to deal with partial page reads.  This emulates
+		 * what do_mpage_read_folio does.
+		 */
+		if (!bio)
+			bio = bio_alloc(iomap->bdev, 1, REQ_OP_READ, orig_gfp);
+		if (ctx->rac)
+			bio->bi_opf |= REQ_RAHEAD;
+		bio->bi_iter.bi_sector = sector;
+		bio->bi_end_io = iomap_read_end_io;
+		bio_add_folio_nofail(bio, folio, plen, poff);
+		ctx->read_ctx = bio;
+	}
+	return 0;
+}
+
+const struct iomap_read_ops iomap_bio_read_ops = {
+	.read_folio_range = iomap_bio_read_folio_range,
+	.submit_read = iomap_bio_submit_read,
+};
+EXPORT_SYMBOL_GPL(iomap_bio_read_ops);
+
+int iomap_bio_read_folio_range_sync(const struct iomap_iter *iter,
+		struct folio *folio, loff_t pos, size_t len)
+{
+	const struct iomap *srcmap = iomap_iter_srcmap(iter);
+	struct bio_vec bvec;
+	struct bio bio;
+
+	bio_init(&bio, srcmap->bdev, &bvec, 1, REQ_OP_READ);
+	bio.bi_iter.bi_sector = iomap_sector(srcmap, pos);
+	bio_add_folio_nofail(&bio, folio, len, offset_in_folio(folio, pos));
+	return submit_bio_wait(&bio);
+}
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 9e1f1f0f8bf1..86c8094e5cc8 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -8,6 +8,7 @@
 #include <linux/writeback.h>
 #include <linux/swap.h>
 #include <linux/migrate.h>
+#include "internal.h"
 #include "trace.h"
 
 #include "../internal.h"
@@ -327,7 +328,6 @@ static int iomap_read_inline_data(const struct iomap_iter *iter,
 	return 0;
 }
 
-#ifdef CONFIG_BLOCK
 void iomap_finish_folio_read(struct folio *folio, size_t off, size_t len,
 		int error)
 {
@@ -351,71 +351,6 @@ void iomap_finish_folio_read(struct folio *folio, size_t off, size_t len,
 }
 EXPORT_SYMBOL_GPL(iomap_finish_folio_read);
 
-static void iomap_read_end_io(struct bio *bio)
-{
-	int error = blk_status_to_errno(bio->bi_status);
-	struct folio_iter fi;
-
-	bio_for_each_folio_all(fi, bio)
-		iomap_finish_folio_read(fi.folio, fi.offset, fi.length, error);
-	bio_put(bio);
-}
-
-static void iomap_bio_submit_read(struct iomap_read_folio_ctx *ctx)
-{
-	struct bio *bio = ctx->read_ctx;
-
-	if (bio)
-		submit_bio(bio);
-}
-
-static int iomap_bio_read_folio_range(const struct iomap_iter *iter,
-		struct iomap_read_folio_ctx *ctx, size_t plen)
-{
-	struct folio *folio = ctx->cur_folio;
-	const struct iomap *iomap = &iter->iomap;
-	loff_t pos = iter->pos;
-	size_t poff = offset_in_folio(folio, pos);
-	loff_t length = iomap_length(iter);
-	sector_t sector;
-	struct bio *bio = ctx->read_ctx;
-
-	sector = iomap_sector(iomap, pos);
-	if (!bio || bio_end_sector(bio) != sector ||
-	    !bio_add_folio(bio, folio, plen, poff)) {
-		gfp_t gfp = mapping_gfp_constraint(folio->mapping, GFP_KERNEL);
-		gfp_t orig_gfp = gfp;
-		unsigned int nr_vecs = DIV_ROUND_UP(length, PAGE_SIZE);
-
-		iomap_bio_submit_read(ctx);
-
-		if (ctx->rac) /* same as readahead_gfp_mask */
-			gfp |= __GFP_NORETRY | __GFP_NOWARN;
-		bio = bio_alloc(iomap->bdev, bio_max_segs(nr_vecs), REQ_OP_READ,
-				     gfp);
-		/*
-		 * If the bio_alloc fails, try it again for a single page to
-		 * avoid having to deal with partial page reads.  This emulates
-		 * what do_mpage_read_folio does.
-		 */
-		if (!bio)
-			bio = bio_alloc(iomap->bdev, 1, REQ_OP_READ, orig_gfp);
-		if (ctx->rac)
-			bio->bi_opf |= REQ_RAHEAD;
-		bio->bi_iter.bi_sector = sector;
-		bio->bi_end_io = iomap_read_end_io;
-		bio_add_folio_nofail(bio, folio, plen, poff);
-		ctx->read_ctx = bio;
-	}
-	return 0;
-}
-
-const struct iomap_read_ops iomap_bio_read_ops = {
-	.read_folio_range	= iomap_bio_read_folio_range,
-	.submit_read		= iomap_bio_submit_read,
-};
-EXPORT_SYMBOL_GPL(iomap_bio_read_ops);
-
 static void iomap_read_init(struct folio *folio)
 {
 	struct iomap_folio_state *ifs = folio->private;
@@ -620,27 +555,6 @@ void iomap_readahead(const struct iomap_ops *ops,
 }
 EXPORT_SYMBOL_GPL(iomap_readahead);
 
-static int iomap_bio_read_folio_range_sync(const struct iomap_iter *iter,
-		struct folio *folio, loff_t pos, size_t len)
-{
-	const struct iomap *srcmap = iomap_iter_srcmap(iter);
-	struct bio_vec bvec;
-	struct bio bio;
-
-	bio_init(&bio, srcmap->bdev, &bvec, 1, REQ_OP_READ);
-	bio.bi_iter.bi_sector = iomap_sector(srcmap, pos);
-	bio_add_folio_nofail(&bio, folio, len, offset_in_folio(folio, pos));
-	return submit_bio_wait(&bio);
-}
-#else
-static int iomap_bio_read_folio_range_sync(const struct iomap_iter *iter,
-		struct folio *folio, loff_t pos, size_t len)
-{
-	WARN_ON_ONCE(1);
-	return -EIO;
-}
-#endif /* CONFIG_BLOCK */
-
 /*
  * iomap_is_partially_uptodate checks whether blocks within a folio are
  * uptodate or not.
diff --git a/fs/iomap/internal.h b/fs/iomap/internal.h
index d05cb3aed96e..3a4e4aad2bd1 100644
--- a/fs/iomap/internal.h
+++ b/fs/iomap/internal.h
@@ -6,4 +6,16 @@
 
 u32 iomap_finish_ioend_direct(struct iomap_ioend *ioend);
 
+#ifdef CONFIG_BLOCK
+int iomap_bio_read_folio_range_sync(const struct iomap_iter *iter,
+		struct folio *folio, loff_t pos, size_t len);
+#else
+static inline int iomap_bio_read_folio_range_sync(const struct iomap_iter *iter,
+		struct folio *folio, loff_t pos, size_t len)
+{
+	WARN_ON_ONCE(1);
+	return -EIO;
+}
+#endif /* CONFIG_BLOCK */
+
 #endif /* _IOMAP_INTERNAL_H */
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v5 11/14] iomap: make iomap_read_folio() a void return
  2025-09-26  0:25 [PATCH v5 00/14] fuse: use iomap for buffered reads + readahead Joanne Koong
                   ` (9 preceding siblings ...)
  2025-09-26  0:26 ` [PATCH v5 10/14] iomap: move buffered io bio logic into new file Joanne Koong
@ 2025-09-26  0:26 ` Joanne Koong
  2025-09-26  0:26 ` [PATCH v5 12/14] fuse: use iomap for read_folio Joanne Koong
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 50+ messages in thread
From: Joanne Koong @ 2025-09-26  0:26 UTC (permalink / raw)
  To: brauner, miklos
  Cc: djwong, hch, hsiangkao, linux-block, gfs2, linux-fsdevel,
	kernel-team, linux-xfs, linux-doc

No errors are propagated in iomap_read_folio(). Change
iomap_read_folio() to a void return to make this clearer to callers.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/iomap/buffered-io.c | 9 +--------
 include/linux/iomap.h  | 2 +-
 2 files changed, 2 insertions(+), 9 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 86c8094e5cc8..f9ae72713f74 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -459,7 +459,7 @@ static int iomap_read_folio_iter(struct iomap_iter *iter,
 	return 0;
 }
 
-int iomap_read_folio(const struct iomap_ops *ops,
+void iomap_read_folio(const struct iomap_ops *ops,
 		struct iomap_read_folio_ctx *ctx)
 {
 	struct folio *folio = ctx->cur_folio;
@@ -480,13 +480,6 @@ int iomap_read_folio(const struct iomap_ops *ops,
 		ctx->ops->submit_read(ctx);
 
 	iomap_read_end(folio, bytes_pending);
-
-	/*
-	 * Just like mpage_readahead and block_read_full_folio, we always
-	 * return 0 and just set the folio error flag on errors.  This
-	 * should be cleaned up throughout the stack eventually.
-	 */
-	return 0;
 }
 EXPORT_SYMBOL_GPL(iomap_read_folio);
 
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 37435b912755..6d864b446b6e 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -338,7 +338,7 @@ static inline bool iomap_want_unshare_iter(const struct iomap_iter *iter)
 ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from,
 		const struct iomap_ops *ops,
 		const struct iomap_write_ops *write_ops, void *private);
-int iomap_read_folio(const struct iomap_ops *ops,
+void iomap_read_folio(const struct iomap_ops *ops,
 		struct iomap_read_folio_ctx *ctx);
 void iomap_readahead(const struct iomap_ops *ops,
 		struct iomap_read_folio_ctx *ctx);
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v5 12/14] fuse: use iomap for read_folio
  2025-09-26  0:25 [PATCH v5 00/14] fuse: use iomap for buffered reads + readahead Joanne Koong
                   ` (10 preceding siblings ...)
  2025-09-26  0:26 ` [PATCH v5 11/14] iomap: make iomap_read_folio() a void return Joanne Koong
@ 2025-09-26  0:26 ` Joanne Koong
  2025-12-23 22:30   ` [RFC PATCH 0/1] iomap: fix race between iomap_set_range_uptodate and folio_end_read Sasha Levin
  2025-09-26  0:26 ` [PATCH v5 13/14] fuse: use iomap for readahead Joanne Koong
                   ` (2 subsequent siblings)
  14 siblings, 1 reply; 50+ messages in thread
From: Joanne Koong @ 2025-09-26  0:26 UTC (permalink / raw)
  To: brauner, miklos
  Cc: djwong, hch, hsiangkao, linux-block, gfs2, linux-fsdevel,
	kernel-team, linux-xfs, linux-doc

Read folio data into the page cache using iomap. This gives us granular
uptodate tracking for large folios, which optimizes how much data needs
to be read in. If some portions of the folio are already uptodate (eg
through a prior write), we only need to read in the non-uptodate
portions.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/fuse/file.c | 80 +++++++++++++++++++++++++++++++++++---------------
 1 file changed, 56 insertions(+), 24 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 4adcf09d4b01..db93c83ee4a3 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -828,23 +828,69 @@ static int fuse_do_readfolio(struct file *file, struct folio *folio,
 	return 0;
 }
 
+static int fuse_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
+			    unsigned int flags, struct iomap *iomap,
+			    struct iomap *srcmap)
+{
+	iomap->type = IOMAP_MAPPED;
+	iomap->length = length;
+	iomap->offset = offset;
+	return 0;
+}
+
+static const struct iomap_ops fuse_iomap_ops = {
+	.iomap_begin	= fuse_iomap_begin,
+};
+
+struct fuse_fill_read_data {
+	struct file *file;
+};
+
+static int fuse_iomap_read_folio_range_async(const struct iomap_iter *iter,
+					     struct iomap_read_folio_ctx *ctx,
+					     size_t len)
+{
+	struct fuse_fill_read_data *data = ctx->read_ctx;
+	struct folio *folio = ctx->cur_folio;
+	loff_t pos =  iter->pos;
+	size_t off = offset_in_folio(folio, pos);
+	struct file *file = data->file;
+	int ret;
+
+	/*
+	 *  for non-readahead read requests, do reads synchronously since
+	 *  it's not guaranteed that the server can handle out-of-order reads
+	 */
+	ret = fuse_do_readfolio(file, folio, off, len);
+	iomap_finish_folio_read(folio, off, len, ret);
+	return ret;
+}
+
+static const struct iomap_read_ops fuse_iomap_read_ops = {
+	.read_folio_range = fuse_iomap_read_folio_range_async,
+};
+
 static int fuse_read_folio(struct file *file, struct folio *folio)
 {
 	struct inode *inode = folio->mapping->host;
-	int err;
+	struct fuse_fill_read_data data = {
+		.file = file,
+	};
+	struct iomap_read_folio_ctx ctx = {
+		.cur_folio = folio,
+		.ops = &fuse_iomap_read_ops,
+		.read_ctx = &data,
 
-	err = -EIO;
-	if (fuse_is_bad(inode))
-		goto out;
+	};
 
-	err = fuse_do_readfolio(file, folio, 0, folio_size(folio));
-	if (!err)
-		folio_mark_uptodate(folio);
+	if (fuse_is_bad(inode)) {
+		folio_unlock(folio);
+		return -EIO;
+	}
 
+	iomap_read_folio(&fuse_iomap_ops, &ctx);
 	fuse_invalidate_atime(inode);
- out:
-	folio_unlock(folio);
-	return err;
+	return 0;
 }
 
 static int fuse_iomap_read_folio_range(const struct iomap_iter *iter,
@@ -1394,20 +1440,6 @@ static const struct iomap_write_ops fuse_iomap_write_ops = {
 	.read_folio_range = fuse_iomap_read_folio_range,
 };
 
-static int fuse_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
-			    unsigned int flags, struct iomap *iomap,
-			    struct iomap *srcmap)
-{
-	iomap->type = IOMAP_MAPPED;
-	iomap->length = length;
-	iomap->offset = offset;
-	return 0;
-}
-
-static const struct iomap_ops fuse_iomap_ops = {
-	.iomap_begin	= fuse_iomap_begin,
-};
-
 static ssize_t fuse_cache_write_iter(struct kiocb *iocb, struct iov_iter *from)
 {
 	struct file *file = iocb->ki_filp;
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v5 13/14] fuse: use iomap for readahead
  2025-09-26  0:25 [PATCH v5 00/14] fuse: use iomap for buffered reads + readahead Joanne Koong
                   ` (11 preceding siblings ...)
  2025-09-26  0:26 ` [PATCH v5 12/14] fuse: use iomap for read_folio Joanne Koong
@ 2025-09-26  0:26 ` Joanne Koong
  2025-09-26  0:26 ` [PATCH v5 14/14] fuse: remove fc->blkbits workaround for partial writes Joanne Koong
  2025-09-29  9:38 ` [PATCH v5 00/14] fuse: use iomap for buffered reads + readahead Christian Brauner
  14 siblings, 0 replies; 50+ messages in thread
From: Joanne Koong @ 2025-09-26  0:26 UTC (permalink / raw)
  To: brauner, miklos
  Cc: djwong, hch, hsiangkao, linux-block, gfs2, linux-fsdevel,
	kernel-team, linux-xfs, linux-doc

Do readahead in fuse using iomap. This gives us granular uptodate
tracking for large folios, which optimizes how much data needs to be
read in. If some portions of the folio are already uptodate (eg through
a prior write), we only need to read in the non-uptodate portions.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/fuse/file.c | 220 ++++++++++++++++++++++++++++---------------------
 1 file changed, 124 insertions(+), 96 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index db93c83ee4a3..7c9c00784e33 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -844,8 +844,65 @@ static const struct iomap_ops fuse_iomap_ops = {
 
 struct fuse_fill_read_data {
 	struct file *file;
+
+	/* Fields below are used if sending the read request asynchronously */
+	struct fuse_conn *fc;
+	struct fuse_io_args *ia;
+	unsigned int nr_bytes;
 };
 
+/* forward declarations */
+static bool fuse_folios_need_send(struct fuse_conn *fc, loff_t pos,
+				  unsigned len, struct fuse_args_pages *ap,
+				  unsigned cur_bytes, bool write);
+static void fuse_send_readpages(struct fuse_io_args *ia, struct file *file,
+				unsigned int count, bool async);
+
+static int fuse_handle_readahead(struct folio *folio,
+				 struct readahead_control *rac,
+				 struct fuse_fill_read_data *data, loff_t pos,
+				 size_t len)
+{
+	struct fuse_io_args *ia = data->ia;
+	size_t off = offset_in_folio(folio, pos);
+	struct fuse_conn *fc = data->fc;
+	struct fuse_args_pages *ap;
+	unsigned int nr_pages;
+
+	if (ia && fuse_folios_need_send(fc, pos, len, &ia->ap, data->nr_bytes,
+					false)) {
+		fuse_send_readpages(ia, data->file, data->nr_bytes,
+				    fc->async_read);
+		data->nr_bytes = 0;
+		data->ia = NULL;
+		ia = NULL;
+	}
+	if (!ia) {
+		if (fc->num_background >= fc->congestion_threshold &&
+		    rac->ra->async_size >= readahead_count(rac))
+			/*
+			 * Congested and only async pages left, so skip the
+			 * rest.
+			 */
+			return -EAGAIN;
+
+		nr_pages = min(fc->max_pages, readahead_count(rac));
+		data->ia = fuse_io_alloc(NULL, nr_pages);
+		if (!data->ia)
+			return -ENOMEM;
+		ia = data->ia;
+	}
+	folio_get(folio);
+	ap = &ia->ap;
+	ap->folios[ap->num_folios] = folio;
+	ap->descs[ap->num_folios].offset = off;
+	ap->descs[ap->num_folios].length = len;
+	data->nr_bytes += len;
+	ap->num_folios++;
+
+	return 0;
+}
+
 static int fuse_iomap_read_folio_range_async(const struct iomap_iter *iter,
 					     struct iomap_read_folio_ctx *ctx,
 					     size_t len)
@@ -857,17 +914,39 @@ static int fuse_iomap_read_folio_range_async(const struct iomap_iter *iter,
 	struct file *file = data->file;
 	int ret;
 
-	/*
-	 *  for non-readahead read requests, do reads synchronously since
-	 *  it's not guaranteed that the server can handle out-of-order reads
-	 */
-	ret = fuse_do_readfolio(file, folio, off, len);
-	iomap_finish_folio_read(folio, off, len, ret);
+	if (ctx->rac) {
+		ret = fuse_handle_readahead(folio, ctx->rac, data, pos, len);
+		/*
+		 * If fuse_handle_readahead was successful, fuse_readpages_end
+		 * will do the iomap_finish_folio_read, else we need to call it
+		 * here
+		 */
+		if (ret)
+			iomap_finish_folio_read(folio, off, len, ret);
+	} else {
+		/*
+		 *  for non-readahead read requests, do reads synchronously
+		 *  since it's not guaranteed that the server can handle
+		 *  out-of-order reads
+		 */
+		ret = fuse_do_readfolio(file, folio, off, len);
+		iomap_finish_folio_read(folio, off, len, ret);
+	}
 	return ret;
 }
 
+static void fuse_iomap_read_submit(struct iomap_read_folio_ctx *ctx)
+{
+	struct fuse_fill_read_data *data = ctx->read_ctx;
+
+	if (data->ia)
+		fuse_send_readpages(data->ia, data->file, data->nr_bytes,
+				    data->fc->async_read);
+}
+
 static const struct iomap_read_ops fuse_iomap_read_ops = {
 	.read_folio_range = fuse_iomap_read_folio_range_async,
+	.submit_read = fuse_iomap_read_submit,
 };
 
 static int fuse_read_folio(struct file *file, struct folio *folio)
@@ -929,7 +1008,8 @@ static void fuse_readpages_end(struct fuse_mount *fm, struct fuse_args *args,
 	}
 
 	for (i = 0; i < ap->num_folios; i++) {
-		folio_end_read(ap->folios[i], !err);
+		iomap_finish_folio_read(ap->folios[i], ap->descs[i].offset,
+					ap->descs[i].length, err);
 		folio_put(ap->folios[i]);
 	}
 	if (ia->ff)
@@ -939,7 +1019,7 @@ static void fuse_readpages_end(struct fuse_mount *fm, struct fuse_args *args,
 }
 
 static void fuse_send_readpages(struct fuse_io_args *ia, struct file *file,
-				unsigned int count)
+				unsigned int count, bool async)
 {
 	struct fuse_file *ff = file->private_data;
 	struct fuse_mount *fm = ff->fm;
@@ -961,7 +1041,7 @@ static void fuse_send_readpages(struct fuse_io_args *ia, struct file *file,
 
 	fuse_read_args_fill(ia, file, pos, count, FUSE_READ);
 	ia->read.attr_ver = fuse_get_attr_version(fm->fc);
-	if (fm->fc->async_read) {
+	if (async) {
 		ia->ff = fuse_file_get(ff);
 		ap->args.end = fuse_readpages_end;
 		err = fuse_simple_background(fm, &ap->args, GFP_KERNEL);
@@ -978,81 +1058,20 @@ static void fuse_readahead(struct readahead_control *rac)
 {
 	struct inode *inode = rac->mapping->host;
 	struct fuse_conn *fc = get_fuse_conn(inode);
-	unsigned int max_pages, nr_pages;
-	struct folio *folio = NULL;
+	struct fuse_fill_read_data data = {
+		.file = rac->file,
+		.fc = fc,
+	};
+	struct iomap_read_folio_ctx ctx = {
+		.ops = &fuse_iomap_read_ops,
+		.rac = rac,
+		.read_ctx = &data
+	};
 
 	if (fuse_is_bad(inode))
 		return;
 
-	max_pages = min_t(unsigned int, fc->max_pages,
-			fc->max_read / PAGE_SIZE);
-
-	/*
-	 * This is only accurate the first time through, since readahead_folio()
-	 * doesn't update readahead_count() from the previous folio until the
-	 * next call.  Grab nr_pages here so we know how many pages we're going
-	 * to have to process.  This means that we will exit here with
-	 * readahead_count() == folio_nr_pages(last_folio), but we will have
-	 * consumed all of the folios, and read_pages() will call
-	 * readahead_folio() again which will clean up the rac.
-	 */
-	nr_pages = readahead_count(rac);
-
-	while (nr_pages) {
-		struct fuse_io_args *ia;
-		struct fuse_args_pages *ap;
-		unsigned cur_pages = min(max_pages, nr_pages);
-		unsigned int pages = 0;
-
-		if (fc->num_background >= fc->congestion_threshold &&
-		    rac->ra->async_size >= readahead_count(rac))
-			/*
-			 * Congested and only async pages left, so skip the
-			 * rest.
-			 */
-			break;
-
-		ia = fuse_io_alloc(NULL, cur_pages);
-		if (!ia)
-			break;
-		ap = &ia->ap;
-
-		while (pages < cur_pages) {
-			unsigned int folio_pages;
-
-			/*
-			 * This returns a folio with a ref held on it.
-			 * The ref needs to be held until the request is
-			 * completed, since the splice case (see
-			 * fuse_try_move_page()) drops the ref after it's
-			 * replaced in the page cache.
-			 */
-			if (!folio)
-				folio =  __readahead_folio(rac);
-
-			folio_pages = folio_nr_pages(folio);
-			if (folio_pages > cur_pages - pages) {
-				/*
-				 * Large folios belonging to fuse will never
-				 * have more pages than max_pages.
-				 */
-				WARN_ON(!pages);
-				break;
-			}
-
-			ap->folios[ap->num_folios] = folio;
-			ap->descs[ap->num_folios].length = folio_size(folio);
-			ap->num_folios++;
-			pages += folio_pages;
-			folio = NULL;
-		}
-		fuse_send_readpages(ia, rac->file, pages << PAGE_SHIFT);
-		nr_pages -= pages;
-	}
-	if (folio) {
-		folio_end_read(folio, false);
-		folio_put(folio);
-	}
+	iomap_readahead(&fuse_iomap_ops, &ctx);
 }
 
 static ssize_t fuse_cache_read_iter(struct kiocb *iocb, struct iov_iter *to)
@@ -2083,7 +2102,7 @@ struct fuse_fill_wb_data {
 	struct fuse_file *ff;
 	unsigned int max_folios;
 	/*
-	 * nr_bytes won't overflow since fuse_writepage_need_send() caps
+	 * nr_bytes won't overflow since fuse_folios_need_send() caps
 	 * wb requests to never exceed fc->max_pages (which has an upper bound
 	 * of U16_MAX).
 	 */
@@ -2128,14 +2147,15 @@ static void fuse_writepages_send(struct inode *inode,
 	spin_unlock(&fi->lock);
 }
 
-static bool fuse_writepage_need_send(struct fuse_conn *fc, loff_t pos,
-				     unsigned len, struct fuse_args_pages *ap,
-				     struct fuse_fill_wb_data *data)
+static bool fuse_folios_need_send(struct fuse_conn *fc, loff_t pos,
+				  unsigned len, struct fuse_args_pages *ap,
+				  unsigned cur_bytes, bool write)
 {
 	struct folio *prev_folio;
 	struct fuse_folio_desc prev_desc;
-	unsigned bytes = data->nr_bytes + len;
+	unsigned bytes = cur_bytes + len;
 	loff_t prev_pos;
+	size_t max_bytes = write ? fc->max_write : fc->max_read;
 
 	WARN_ON(!ap->num_folios);
 
@@ -2143,8 +2163,7 @@ static bool fuse_writepage_need_send(struct fuse_conn *fc, loff_t pos,
 	if ((bytes + PAGE_SIZE - 1) >> PAGE_SHIFT > fc->max_pages)
 		return true;
 
-	/* Reached max write bytes */
-	if (bytes > fc->max_write)
+	if (bytes > max_bytes)
 		return true;
 
 	/* Discontinuity */
@@ -2154,11 +2173,6 @@ static bool fuse_writepage_need_send(struct fuse_conn *fc, loff_t pos,
 	if (prev_pos != pos)
 		return true;
 
-	/* Need to grow the pages array?  If so, did the expansion fail? */
-	if (ap->num_folios == data->max_folios &&
-	    !fuse_pages_realloc(data, fc->max_pages))
-		return true;
-
 	return false;
 }
 
@@ -2182,10 +2196,24 @@ static ssize_t fuse_iomap_writeback_range(struct iomap_writepage_ctx *wpc,
 			return -EIO;
 	}
 
-	if (wpa && fuse_writepage_need_send(fc, pos, len, ap, data)) {
-		fuse_writepages_send(inode, data);
-		data->wpa = NULL;
-		data->nr_bytes = 0;
+	if (wpa) {
+		bool send = fuse_folios_need_send(fc, pos, len, ap,
+						  data->nr_bytes, true);
+
+		if (!send) {
+			/*
+			 * Need to grow the pages array?  If so, did the
+			 * expansion fail?
+			 */
+			send = (ap->num_folios == data->max_folios) &&
+				!fuse_pages_realloc(data, fc->max_pages);
+		}
+
+		if (send) {
+			fuse_writepages_send(inode, data);
+			data->wpa = NULL;
+			data->nr_bytes = 0;
+		}
 	}
 
 	if (data->wpa == NULL) {
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v5 14/14] fuse: remove fc->blkbits workaround for partial writes
  2025-09-26  0:25 [PATCH v5 00/14] fuse: use iomap for buffered reads + readahead Joanne Koong
                   ` (12 preceding siblings ...)
  2025-09-26  0:26 ` [PATCH v5 13/14] fuse: use iomap for readahead Joanne Koong
@ 2025-09-26  0:26 ` Joanne Koong
  2025-09-29  9:38 ` [PATCH v5 00/14] fuse: use iomap for buffered reads + readahead Christian Brauner
  14 siblings, 0 replies; 50+ messages in thread
From: Joanne Koong @ 2025-09-26  0:26 UTC (permalink / raw)
  To: brauner, miklos
  Cc: djwong, hch, hsiangkao, linux-block, gfs2, linux-fsdevel,
	kernel-team, linux-xfs, linux-doc

Now that fuse is integrated with iomap for read/readahead, we can remove
the workaround that was added in commit bd24d2108e9c ("fuse: fix fuseblk
i_blkbits for iomap partial writes"), which was previously needed to
avoid a race condition where an iomap partial write may be overwritten
by a read if blocksize < PAGE_SIZE. Now that fuse does iomap
read/readahead, this is protected against since there is granular
uptodate tracking of blocks, which means this workaround can be removed.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/fuse/dir.c    |  2 +-
 fs/fuse/fuse_i.h |  8 --------
 fs/fuse/inode.c  | 13 +------------
 3 files changed, 2 insertions(+), 21 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 5c569c3cb53f..ebee7e0b1cd3 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1199,7 +1199,7 @@ static void fuse_fillattr(struct mnt_idmap *idmap, struct inode *inode,
 	if (attr->blksize != 0)
 		blkbits = ilog2(attr->blksize);
 	else
-		blkbits = fc->blkbits;
+		blkbits = inode->i_sb->s_blocksize_bits;
 
 	stat->blksize = 1 << blkbits;
 }
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index cc428d04be3e..1647eb7ca6fa 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -975,14 +975,6 @@ struct fuse_conn {
 		/* Request timeout (in jiffies). 0 = no timeout */
 		unsigned int req_timeout;
 	} timeout;
-
-	/*
-	 * This is a workaround until fuse uses iomap for reads.
-	 * For fuseblk servers, this represents the blocksize passed in at
-	 * mount time and for regular fuse servers, this is equivalent to
-	 * inode->i_blkbits.
-	 */
-	u8 blkbits;
 };
 
 /*
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 7485a41af892..a1b9e8587155 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -292,7 +292,7 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
 	if (attr->blksize)
 		fi->cached_i_blkbits = ilog2(attr->blksize);
 	else
-		fi->cached_i_blkbits = fc->blkbits;
+		fi->cached_i_blkbits = inode->i_sb->s_blocksize_bits;
 
 	/*
 	 * Don't set the sticky bit in i_mode, unless we want the VFS
@@ -1810,21 +1810,10 @@ int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx)
 		err = -EINVAL;
 		if (!sb_set_blocksize(sb, ctx->blksize))
 			goto err;
-		/*
-		 * This is a workaround until fuse hooks into iomap for reads.
-		 * Use PAGE_SIZE for the blocksize else if the writeback cache
-		 * is enabled, buffered writes go through iomap and a read may
-		 * overwrite partially written data if blocksize < PAGE_SIZE
-		 */
-		fc->blkbits = sb->s_blocksize_bits;
-		if (ctx->blksize != PAGE_SIZE &&
-		    !sb_set_blocksize(sb, PAGE_SIZE))
-			goto err;
 #endif
 	} else {
 		sb->s_blocksize = PAGE_SIZE;
 		sb->s_blocksize_bits = PAGE_SHIFT;
-		fc->blkbits = sb->s_blocksize_bits;
 	}
 
 	sb->s_subtype = ctx->subtype;
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH v5 00/14] fuse: use iomap for buffered reads + readahead
  2025-09-26  0:25 [PATCH v5 00/14] fuse: use iomap for buffered reads + readahead Joanne Koong
                   ` (13 preceding siblings ...)
  2025-09-26  0:26 ` [PATCH v5 14/14] fuse: remove fc->blkbits workaround for partial writes Joanne Koong
@ 2025-09-29  9:38 ` Christian Brauner
  14 siblings, 0 replies; 50+ messages in thread
From: Christian Brauner @ 2025-09-29  9:38 UTC (permalink / raw)
  To: miklos, Joanne Koong
  Cc: Christian Brauner, djwong, hch, linux-block, gfs2, linux-fsdevel,
	kernel-team, linux-xfs, linux-doc, Gao Xiang

On Thu, 25 Sep 2025 17:25:55 -0700, Joanne Koong wrote:
> This series adds fuse iomap support for buffered reads and readahead.
> This is needed so that granular uptodate tracking can be used in fuse when
> large folios are enabled so that only the non-uptodate portions of the folio
> need to be read in instead of having to read in the entire folio. It also is
> needed in order to turn on large folios for servers that use the writeback
> cache since otherwise there is a race condition that may lead to data
> corruption if there is a partial write, then a read and the read happens
> before the write has undergone writeback, since otherwise the folio will not
> be marked uptodate from the partial write so the read will read in the entire
> folio from disk, which will overwrite the partial write.
> 
> [...]

Applied to the vfs-6.19.iomap branch of the vfs/vfs.git tree.
Patches in the vfs-6.19.iomap branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs-6.19.iomap

[01/14] iomap: move bio read logic into helper function
        https://git.kernel.org/vfs/vfs/c/4b1f54633425
[02/14] iomap: move read/readahead bio submission logic into helper function
        https://git.kernel.org/vfs/vfs/c/22159441469a
[03/14] iomap: store read/readahead bio generically
        https://git.kernel.org/vfs/vfs/c/7c732b99c04f
[04/14] iomap: iterate over folio mapping in iomap_readpage_iter()
        https://git.kernel.org/vfs/vfs/c/3b404627d3e2
[05/14] iomap: rename iomap_readpage_iter() to iomap_read_folio_iter()
        https://git.kernel.org/vfs/vfs/c/bf8b9f4ce6a9
[06/14] iomap: rename iomap_readpage_ctx struct to iomap_read_folio_ctx
        https://git.kernel.org/vfs/vfs/c/abea60c60330
[07/14] iomap: track pending read bytes more optimally
        https://git.kernel.org/vfs/vfs/c/13cc90f6c38e
[08/14] iomap: set accurate iter->pos when reading folio ranges
        https://git.kernel.org/vfs/vfs/c/63adb033604e
[09/14] iomap: add caller-provided callbacks for read and readahead
        https://git.kernel.org/vfs/vfs/c/56b6f5d3792b
[10/14] iomap: move buffered io bio logic into new file
        https://git.kernel.org/vfs/vfs/c/80cd9857c47f
[11/14] iomap: make iomap_read_folio() a void return
        https://git.kernel.org/vfs/vfs/c/434651f1a9b7
[12/14] fuse: use iomap for read_folio
        https://git.kernel.org/vfs/vfs/c/12cae30dc565
[13/14] fuse: use iomap for readahead
        https://git.kernel.org/vfs/vfs/c/0853f58ed0b4
[14/14] fuse: remove fc->blkbits workaround for partial writes
        https://git.kernel.org/vfs/vfs/c/bb944dc82db1

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v5 07/14] iomap: track pending read bytes more optimally
  2025-09-26  0:26 ` [PATCH v5 07/14] iomap: track pending read bytes more optimally Joanne Koong
@ 2025-10-23 19:34   ` Brian Foster
  2025-10-24  0:01     ` Joanne Koong
  0 siblings, 1 reply; 50+ messages in thread
From: Brian Foster @ 2025-10-23 19:34 UTC (permalink / raw)
  To: Joanne Koong
  Cc: brauner, miklos, djwong, hch, hsiangkao, linux-block, gfs2,
	linux-fsdevel, kernel-team, linux-xfs, linux-doc

On Thu, Sep 25, 2025 at 05:26:02PM -0700, Joanne Koong wrote:
> Instead of incrementing read_bytes_pending for every folio range read in
> (which requires acquiring the spinlock to do so), set read_bytes_pending
> to the folio size when the first range is asynchronously read in, keep
> track of how many bytes total are asynchronously read in, and adjust
> read_bytes_pending accordingly after issuing requests to read in all the
> necessary ranges.
> 
> iomap_read_folio_ctx->cur_folio_in_bio can be removed since a non-zero
> value for pending bytes necessarily indicates the folio is in the bio.
> 
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> Suggested-by: "Darrick J. Wong" <djwong@kernel.org>
> ---

Hi Joanne,

I was throwing some extra testing at the vfs-6.19.iomap branch since the
little merge conflict thing with iomap_iter_advance(). I end up hitting
what appears to be a lockup on XFS with 1k FSB (-bsize=1k) running
generic/051. It reproduces fairly reliably within a few iterations or so
and seems to always stall during a read for a dedupe operation:

task:fsstress        state:D stack:0     pid:12094 tgid:12094 ppid:12091  task_flags:0x400140 flags:0x00080003
Call Trace:
 <TASK>
 __schedule+0x2fc/0x7a0
 schedule+0x27/0x80
 io_schedule+0x46/0x70
 folio_wait_bit_common+0x12b/0x310
 ? __pfx_wake_page_function+0x10/0x10
 ? __pfx_xfs_vm_read_folio+0x10/0x10 [xfs]
 filemap_read_folio+0x85/0xd0
 ? __pfx_xfs_vm_read_folio+0x10/0x10 [xfs]
 do_read_cache_folio+0x7c/0x1b0
 vfs_dedupe_file_range_compare.constprop.0+0xaf/0x2d0
 __generic_remap_file_range_prep+0x276/0x2a0
 generic_remap_file_range_prep+0x10/0x20
 xfs_reflink_remap_prep+0x22c/0x300 [xfs]
 xfs_file_remap_range+0x84/0x360 [xfs]
 vfs_dedupe_file_range_one+0x1b2/0x1d0
 ? remap_verify_area+0x46/0x140
 vfs_dedupe_file_range+0x162/0x220
 do_vfs_ioctl+0x4d1/0x940
 __x64_sys_ioctl+0x75/0xe0
 do_syscall_64+0x84/0x800
 ? do_syscall_64+0xbb/0x800
 ? avc_has_perm_noaudit+0x6b/0xf0
 ? _copy_to_user+0x31/0x40
 ? cp_new_stat+0x130/0x170
 ? __do_sys_newfstat+0x44/0x70
 ? do_syscall_64+0xbb/0x800
 ? do_syscall_64+0xbb/0x800
 ? clear_bhb_loop+0x30/0x80
 ? clear_bhb_loop+0x30/0x80
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7fe6bbd9a14d
RSP: 002b:00007ffde72cd4e0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 0000000000000068 RCX: 00007fe6bbd9a14d
RDX: 000000000a1394b0 RSI: 00000000c0189436 RDI: 0000000000000004
RBP: 00007ffde72cd530 R08: 0000000000001000 R09: 000000000a11a3fc
R10: 000000000001d6c0 R11: 0000000000000246 R12: 000000000a12cfb0
R13: 000000000a12ba10 R14: 000000000a14e610 R15: 0000000000019000
 </TASK>

It wasn't immediately clear to me what the issue was so I bisected and
it landed on this patch. It kind of looks like we're failing to unlock a
folio at some point and then tripping over it later..? I can kill the
fsstress process but then the umount ultimately gets stuck tossing
pagecache [1], so the mount still ends up stuck indefinitely. Anyways,
I'll poke at it some more but I figure you might be able to make sense
of this faster than I can.

Brian

[1] umount stack trace: 

task:umount          state:D stack:0     pid:12216 tgid:12216 ppid:2514   task_flags:0x400100 flags:0x00080001
Call Trace:
 <TASK>
 __schedule+0x2fc/0x7a0
 schedule+0x27/0x80
 io_schedule+0x46/0x70
 folio_wait_bit_common+0x12b/0x310
 ? __pfx_wake_page_function+0x10/0x10
 truncate_inode_pages_range+0x42a/0x4d0
 xfs_fs_evict_inode+0x1f/0x30 [xfs]
 evict+0x112/0x290
 evict_inodes+0x209/0x230
 generic_shutdown_super+0x42/0x100
 kill_block_super+0x1a/0x40
 xfs_kill_sb+0x12/0x20 [xfs]
 deactivate_locked_super+0x33/0xb0
 cleanup_mnt+0xba/0x150
 task_work_run+0x5c/0x90
 exit_to_user_mode_loop+0x12f/0x170
 do_syscall_64+0x1af/0x800
 ? vfs_statx+0x80/0x160
 ? do_statx+0x62/0xa0
 ? __x64_sys_statx+0xaf/0x100
 ? do_syscall_64+0xbb/0x800
 ? __x64_sys_statx+0xaf/0x100
 ? do_syscall_64+0xbb/0x800
 ? count_memcg_events+0xdd/0x1b0
 ? handle_mm_fault+0x220/0x340
 ? do_user_addr_fault+0x2c3/0x7f0
 ? clear_bhb_loop+0x30/0x80
 ? clear_bhb_loop+0x30/0x80
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7fdd641ed5ab
RSP: 002b:00007ffd671182e8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 0000559b3e2056b0 RCX: 00007fdd641ed5ab
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000559b3e205ac0
RBP: 00007ffd671183c0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000103 R11: 0000000000000246 R12: 0000559b3e2057b8
R13: 0000000000000000 R14: 0000559b3e205ac0 R15: 0000000000000000
 </TASK>

>  fs/iomap/buffered-io.c | 87 ++++++++++++++++++++++++++++++++----------
>  1 file changed, 66 insertions(+), 21 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 09e65771a947..4e6258fdb915 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -362,7 +362,6 @@ static void iomap_read_end_io(struct bio *bio)
>  
>  struct iomap_read_folio_ctx {
>  	struct folio		*cur_folio;
> -	bool			cur_folio_in_bio;
>  	void			*read_ctx;
>  	struct readahead_control *rac;
>  };
> @@ -380,19 +379,11 @@ static void iomap_bio_read_folio_range(const struct iomap_iter *iter,
>  {
>  	struct folio *folio = ctx->cur_folio;
>  	const struct iomap *iomap = &iter->iomap;
> -	struct iomap_folio_state *ifs = folio->private;
>  	size_t poff = offset_in_folio(folio, pos);
>  	loff_t length = iomap_length(iter);
>  	sector_t sector;
>  	struct bio *bio = ctx->read_ctx;
>  
> -	ctx->cur_folio_in_bio = true;
> -	if (ifs) {
> -		spin_lock_irq(&ifs->state_lock);
> -		ifs->read_bytes_pending += plen;
> -		spin_unlock_irq(&ifs->state_lock);
> -	}
> -
>  	sector = iomap_sector(iomap, pos);
>  	if (!bio || bio_end_sector(bio) != sector ||
>  	    !bio_add_folio(bio, folio, plen, poff)) {
> @@ -422,8 +413,57 @@ static void iomap_bio_read_folio_range(const struct iomap_iter *iter,
>  	}
>  }
>  
> +static void iomap_read_init(struct folio *folio)
> +{
> +	struct iomap_folio_state *ifs = folio->private;
> +
> +	if (ifs) {
> +		size_t len = folio_size(folio);
> +
> +		spin_lock_irq(&ifs->state_lock);
> +		ifs->read_bytes_pending += len;
> +		spin_unlock_irq(&ifs->state_lock);
> +	}
> +}
> +
> +static void iomap_read_end(struct folio *folio, size_t bytes_pending)
> +{
> +	struct iomap_folio_state *ifs;
> +
> +	/*
> +	 * If there are no bytes pending, this means we are responsible for
> +	 * unlocking the folio here, since no IO helper has taken ownership of
> +	 * it.
> +	 */
> +	if (!bytes_pending) {
> +		folio_unlock(folio);
> +		return;
> +	}
> +
> +	ifs = folio->private;
> +	if (ifs) {
> +		bool end_read, uptodate;
> +		size_t bytes_accounted = folio_size(folio) - bytes_pending;
> +
> +		spin_lock_irq(&ifs->state_lock);
> +		ifs->read_bytes_pending -= bytes_accounted;
> +		/*
> +		 * If !ifs->read_bytes_pending, this means all pending reads
> +		 * by the IO helper have already completed, which means we need
> +		 * to end the folio read here. If ifs->read_bytes_pending != 0,
> +		 * the IO helper will end the folio read.
> +		 */
> +		end_read = !ifs->read_bytes_pending;
> +		if (end_read)
> +			uptodate = ifs_is_fully_uptodate(folio, ifs);
> +		spin_unlock_irq(&ifs->state_lock);
> +		if (end_read)
> +			folio_end_read(folio, uptodate);
> +	}
> +}
> +
>  static int iomap_read_folio_iter(struct iomap_iter *iter,
> -		struct iomap_read_folio_ctx *ctx)
> +		struct iomap_read_folio_ctx *ctx, size_t *bytes_pending)
>  {
>  	const struct iomap *iomap = &iter->iomap;
>  	loff_t pos = iter->pos;
> @@ -460,6 +500,9 @@ static int iomap_read_folio_iter(struct iomap_iter *iter,
>  			folio_zero_range(folio, poff, plen);
>  			iomap_set_range_uptodate(folio, poff, plen);
>  		} else {
> +			if (!*bytes_pending)
> +				iomap_read_init(folio);
> +			*bytes_pending += plen;
>  			iomap_bio_read_folio_range(iter, ctx, pos, plen);
>  		}
>  
> @@ -482,17 +525,18 @@ int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops)
>  	struct iomap_read_folio_ctx ctx = {
>  		.cur_folio	= folio,
>  	};
> +	size_t bytes_pending = 0;
>  	int ret;
>  
>  	trace_iomap_readpage(iter.inode, 1);
>  
>  	while ((ret = iomap_iter(&iter, ops)) > 0)
> -		iter.status = iomap_read_folio_iter(&iter, &ctx);
> +		iter.status = iomap_read_folio_iter(&iter, &ctx,
> +				&bytes_pending);
>  
>  	iomap_bio_submit_read(&ctx);
>  
> -	if (!ctx.cur_folio_in_bio)
> -		folio_unlock(folio);
> +	iomap_read_end(folio, bytes_pending);
>  
>  	/*
>  	 * Just like mpage_readahead and block_read_full_folio, we always
> @@ -504,24 +548,23 @@ int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops)
>  EXPORT_SYMBOL_GPL(iomap_read_folio);
>  
>  static int iomap_readahead_iter(struct iomap_iter *iter,
> -		struct iomap_read_folio_ctx *ctx)
> +		struct iomap_read_folio_ctx *ctx, size_t *cur_bytes_pending)
>  {
>  	int ret;
>  
>  	while (iomap_length(iter)) {
>  		if (ctx->cur_folio &&
>  		    offset_in_folio(ctx->cur_folio, iter->pos) == 0) {
> -			if (!ctx->cur_folio_in_bio)
> -				folio_unlock(ctx->cur_folio);
> +			iomap_read_end(ctx->cur_folio, *cur_bytes_pending);
>  			ctx->cur_folio = NULL;
>  		}
>  		if (!ctx->cur_folio) {
>  			ctx->cur_folio = readahead_folio(ctx->rac);
>  			if (WARN_ON_ONCE(!ctx->cur_folio))
>  				return -EINVAL;
> -			ctx->cur_folio_in_bio = false;
> +			*cur_bytes_pending = 0;
>  		}
> -		ret = iomap_read_folio_iter(iter, ctx);
> +		ret = iomap_read_folio_iter(iter, ctx, cur_bytes_pending);
>  		if (ret)
>  			return ret;
>  	}
> @@ -554,16 +597,18 @@ void iomap_readahead(struct readahead_control *rac, const struct iomap_ops *ops)
>  	struct iomap_read_folio_ctx ctx = {
>  		.rac	= rac,
>  	};
> +	size_t cur_bytes_pending;
>  
>  	trace_iomap_readahead(rac->mapping->host, readahead_count(rac));
>  
>  	while (iomap_iter(&iter, ops) > 0)
> -		iter.status = iomap_readahead_iter(&iter, &ctx);
> +		iter.status = iomap_readahead_iter(&iter, &ctx,
> +					&cur_bytes_pending);
>  
>  	iomap_bio_submit_read(&ctx);
>  
> -	if (ctx.cur_folio && !ctx.cur_folio_in_bio)
> -		folio_unlock(ctx.cur_folio);
> +	if (ctx.cur_folio)
> +		iomap_read_end(ctx.cur_folio, cur_bytes_pending);
>  }
>  EXPORT_SYMBOL_GPL(iomap_readahead);
>  
> -- 
> 2.47.3
> 
> 


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v5 07/14] iomap: track pending read bytes more optimally
  2025-10-23 19:34   ` Brian Foster
@ 2025-10-24  0:01     ` Joanne Koong
  2025-10-24 16:25       ` Joanne Koong
  0 siblings, 1 reply; 50+ messages in thread
From: Joanne Koong @ 2025-10-24  0:01 UTC (permalink / raw)
  To: Brian Foster
  Cc: brauner, miklos, djwong, hch, hsiangkao, linux-block, gfs2,
	linux-fsdevel, kernel-team, linux-xfs, linux-doc

On Thu, Oct 23, 2025 at 12:30 PM Brian Foster <bfoster@redhat.com> wrote:
>
> On Thu, Sep 25, 2025 at 05:26:02PM -0700, Joanne Koong wrote:
> > Instead of incrementing read_bytes_pending for every folio range read in
> > (which requires acquiring the spinlock to do so), set read_bytes_pending
> > to the folio size when the first range is asynchronously read in, keep
> > track of how many bytes total are asynchronously read in, and adjust
> > read_bytes_pending accordingly after issuing requests to read in all the
> > necessary ranges.
> >
> > iomap_read_folio_ctx->cur_folio_in_bio can be removed since a non-zero
> > value for pending bytes necessarily indicates the folio is in the bio.
> >
> > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > Suggested-by: "Darrick J. Wong" <djwong@kernel.org>
> > ---
>
> Hi Joanne,
>
> I was throwing some extra testing at the vfs-6.19.iomap branch since the
> little merge conflict thing with iomap_iter_advance(). I end up hitting
> what appears to be a lockup on XFS with 1k FSB (-bsize=1k) running
> generic/051. It reproduces fairly reliably within a few iterations or so
> and seems to always stall during a read for a dedupe operation:
>
> task:fsstress        state:D stack:0     pid:12094 tgid:12094 ppid:12091  task_flags:0x400140 flags:0x00080003
> Call Trace:
>  <TASK>
>  __schedule+0x2fc/0x7a0
>  schedule+0x27/0x80
>  io_schedule+0x46/0x70
>  folio_wait_bit_common+0x12b/0x310
>  ? __pfx_wake_page_function+0x10/0x10
>  ? __pfx_xfs_vm_read_folio+0x10/0x10 [xfs]
>  filemap_read_folio+0x85/0xd0
>  ? __pfx_xfs_vm_read_folio+0x10/0x10 [xfs]
>  do_read_cache_folio+0x7c/0x1b0
>  vfs_dedupe_file_range_compare.constprop.0+0xaf/0x2d0
>  __generic_remap_file_range_prep+0x276/0x2a0
>  generic_remap_file_range_prep+0x10/0x20
>  xfs_reflink_remap_prep+0x22c/0x300 [xfs]
>  xfs_file_remap_range+0x84/0x360 [xfs]
>  vfs_dedupe_file_range_one+0x1b2/0x1d0
>  ? remap_verify_area+0x46/0x140
>  vfs_dedupe_file_range+0x162/0x220
>  do_vfs_ioctl+0x4d1/0x940
>  __x64_sys_ioctl+0x75/0xe0
>  do_syscall_64+0x84/0x800
>  ? do_syscall_64+0xbb/0x800
>  ? avc_has_perm_noaudit+0x6b/0xf0
>  ? _copy_to_user+0x31/0x40
>  ? cp_new_stat+0x130/0x170
>  ? __do_sys_newfstat+0x44/0x70
>  ? do_syscall_64+0xbb/0x800
>  ? do_syscall_64+0xbb/0x800
>  ? clear_bhb_loop+0x30/0x80
>  ? clear_bhb_loop+0x30/0x80
>  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> RIP: 0033:0x7fe6bbd9a14d
> RSP: 002b:00007ffde72cd4e0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> RAX: ffffffffffffffda RBX: 0000000000000068 RCX: 00007fe6bbd9a14d
> RDX: 000000000a1394b0 RSI: 00000000c0189436 RDI: 0000000000000004
> RBP: 00007ffde72cd530 R08: 0000000000001000 R09: 000000000a11a3fc
> R10: 000000000001d6c0 R11: 0000000000000246 R12: 000000000a12cfb0
> R13: 000000000a12ba10 R14: 000000000a14e610 R15: 0000000000019000
>  </TASK>
>
> It wasn't immediately clear to me what the issue was so I bisected and
> it landed on this patch. It kind of looks like we're failing to unlock a
> folio at some point and then tripping over it later..? I can kill the
> fsstress process but then the umount ultimately gets stuck tossing
> pagecache [1], so the mount still ends up stuck indefinitely. Anyways,
> I'll poke at it some more but I figure you might be able to make sense
> of this faster than I can.
>
> Brian

Hi Brian,

Thanks for your report and the repro instructions. I will look into
this and report back what I find.

Thanks,
Joanne
>
> [1] umount stack trace:
>
> task:umount          state:D stack:0     pid:12216 tgid:12216 ppid:2514   task_flags:0x400100 flags:0x00080001
> Call Trace:
>  <TASK>
>  __schedule+0x2fc/0x7a0
>  schedule+0x27/0x80
>  io_schedule+0x46/0x70
>  folio_wait_bit_common+0x12b/0x310
>  ? __pfx_wake_page_function+0x10/0x10
>  truncate_inode_pages_range+0x42a/0x4d0
>  xfs_fs_evict_inode+0x1f/0x30 [xfs]
>  evict+0x112/0x290
>  evict_inodes+0x209/0x230
>  generic_shutdown_super+0x42/0x100
>  kill_block_super+0x1a/0x40
>  xfs_kill_sb+0x12/0x20 [xfs]
>  deactivate_locked_super+0x33/0xb0
>  cleanup_mnt+0xba/0x150
>  task_work_run+0x5c/0x90
>  exit_to_user_mode_loop+0x12f/0x170
>  do_syscall_64+0x1af/0x800
>  ? vfs_statx+0x80/0x160
>  ? do_statx+0x62/0xa0
>  ? __x64_sys_statx+0xaf/0x100
>  ? do_syscall_64+0xbb/0x800
>  ? __x64_sys_statx+0xaf/0x100
>  ? do_syscall_64+0xbb/0x800
>  ? count_memcg_events+0xdd/0x1b0
>  ? handle_mm_fault+0x220/0x340
>  ? do_user_addr_fault+0x2c3/0x7f0
>  ? clear_bhb_loop+0x30/0x80
>  ? clear_bhb_loop+0x30/0x80
>  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> RIP: 0033:0x7fdd641ed5ab
> RSP: 002b:00007ffd671182e8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
> RAX: 0000000000000000 RBX: 0000559b3e2056b0 RCX: 00007fdd641ed5ab
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000559b3e205ac0
> RBP: 00007ffd671183c0 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000103 R11: 0000000000000246 R12: 0000559b3e2057b8
> R13: 0000000000000000 R14: 0000559b3e205ac0 R15: 0000000000000000
>  </TASK>
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v5 07/14] iomap: track pending read bytes more optimally
  2025-10-24  0:01     ` Joanne Koong
@ 2025-10-24 16:25       ` Joanne Koong
  2025-10-24 17:14         ` Brian Foster
  2025-10-24 17:21         ` Matthew Wilcox
  0 siblings, 2 replies; 50+ messages in thread
From: Joanne Koong @ 2025-10-24 16:25 UTC (permalink / raw)
  To: Brian Foster
  Cc: brauner, miklos, djwong, hch, hsiangkao, linux-block, gfs2,
	linux-fsdevel, kernel-team, linux-xfs, linux-doc

On Thu, Oct 23, 2025 at 5:01 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Thu, Oct 23, 2025 at 12:30 PM Brian Foster <bfoster@redhat.com> wrote:
> >
> > On Thu, Sep 25, 2025 at 05:26:02PM -0700, Joanne Koong wrote:
> > > Instead of incrementing read_bytes_pending for every folio range read in
> > > (which requires acquiring the spinlock to do so), set read_bytes_pending
> > > to the folio size when the first range is asynchronously read in, keep
> > > track of how many bytes total are asynchronously read in, and adjust
> > > read_bytes_pending accordingly after issuing requests to read in all the
> > > necessary ranges.
> > >
> > > iomap_read_folio_ctx->cur_folio_in_bio can be removed since a non-zero
> > > value for pending bytes necessarily indicates the folio is in the bio.
> > >
> > > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > > Suggested-by: "Darrick J. Wong" <djwong@kernel.org>
> > > ---
> >
> > Hi Joanne,
> >
> > I was throwing some extra testing at the vfs-6.19.iomap branch since the
> > little merge conflict thing with iomap_iter_advance(). I end up hitting
> > what appears to be a lockup on XFS with 1k FSB (-bsize=1k) running
> > generic/051. It reproduces fairly reliably within a few iterations or so
> > and seems to always stall during a read for a dedupe operation:
> >
> > task:fsstress        state:D stack:0     pid:12094 tgid:12094 ppid:12091  task_flags:0x400140 flags:0x00080003
> > Call Trace:
> >  <TASK>
> >  __schedule+0x2fc/0x7a0
> >  schedule+0x27/0x80
> >  io_schedule+0x46/0x70
> >  folio_wait_bit_common+0x12b/0x310
> >  ? __pfx_wake_page_function+0x10/0x10
> >  ? __pfx_xfs_vm_read_folio+0x10/0x10 [xfs]
> >  filemap_read_folio+0x85/0xd0
> >  ? __pfx_xfs_vm_read_folio+0x10/0x10 [xfs]
> >  do_read_cache_folio+0x7c/0x1b0
> >  vfs_dedupe_file_range_compare.constprop.0+0xaf/0x2d0
> >  __generic_remap_file_range_prep+0x276/0x2a0
> >  generic_remap_file_range_prep+0x10/0x20
> >  xfs_reflink_remap_prep+0x22c/0x300 [xfs]
> >  xfs_file_remap_range+0x84/0x360 [xfs]
> >  vfs_dedupe_file_range_one+0x1b2/0x1d0
> >  ? remap_verify_area+0x46/0x140
> >  vfs_dedupe_file_range+0x162/0x220
> >  do_vfs_ioctl+0x4d1/0x940
> >  __x64_sys_ioctl+0x75/0xe0
> >  do_syscall_64+0x84/0x800
> >  ? do_syscall_64+0xbb/0x800
> >  ? avc_has_perm_noaudit+0x6b/0xf0
> >  ? _copy_to_user+0x31/0x40
> >  ? cp_new_stat+0x130/0x170
> >  ? __do_sys_newfstat+0x44/0x70
> >  ? do_syscall_64+0xbb/0x800
> >  ? do_syscall_64+0xbb/0x800
> >  ? clear_bhb_loop+0x30/0x80
> >  ? clear_bhb_loop+0x30/0x80
> >  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > RIP: 0033:0x7fe6bbd9a14d
> > RSP: 002b:00007ffde72cd4e0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> > RAX: ffffffffffffffda RBX: 0000000000000068 RCX: 00007fe6bbd9a14d
> > RDX: 000000000a1394b0 RSI: 00000000c0189436 RDI: 0000000000000004
> > RBP: 00007ffde72cd530 R08: 0000000000001000 R09: 000000000a11a3fc
> > R10: 000000000001d6c0 R11: 0000000000000246 R12: 000000000a12cfb0
> > R13: 000000000a12ba10 R14: 000000000a14e610 R15: 0000000000019000
> >  </TASK>
> >
> > It wasn't immediately clear to me what the issue was so I bisected and
> > it landed on this patch. It kind of looks like we're failing to unlock a
> > folio at some point and then tripping over it later..? I can kill the
> > fsstress process but then the umount ultimately gets stuck tossing
> > pagecache [1], so the mount still ends up stuck indefinitely. Anyways,
> > I'll poke at it some more but I figure you might be able to make sense
> > of this faster than I can.
> >
> > Brian
>
> Hi Brian,
>
> Thanks for your report and the repro instructions. I will look into
> this and report back what I find.

This is the fix:

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 4e6258fdb915..aa46fec8362d 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -445,6 +445,9 @@ static void iomap_read_end(struct folio *folio,
size_t bytes_pending)
                bool end_read, uptodate;
                size_t bytes_accounted = folio_size(folio) - bytes_pending;

+               if (!bytes_accounted)
+                       return;
+
                spin_lock_irq(&ifs->state_lock);


What I missed was that if all the bytes in the folio are non-uptodate
and need to read in by the filesystem, then there's a bug where the
read will be ended on the folio twice (in iomap_read_end() and when
the filesystem calls iomap_finish_folio_write(), when only the
filesystem should end the read), which does 2 folio unlocks which ends
up locking the folio. Looking at the writeback patch that does a
similar optimization [1], I miss the same thing there.

I'll fix up both. Thanks for catching this and bisecting it down to
this patch. Sorry for the trouble.

Thanks,
Joanne

[1] https://lore.kernel.org/linux-fsdevel/20251009225611.3744728-4-joannelkoong@gmail.com/
>
> Thanks,
> Joanne
> >

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH v5 07/14] iomap: track pending read bytes more optimally
  2025-10-24 16:25       ` Joanne Koong
@ 2025-10-24 17:14         ` Brian Foster
  2025-10-24 19:48           ` Joanne Koong
  2025-10-24 17:21         ` Matthew Wilcox
  1 sibling, 1 reply; 50+ messages in thread
From: Brian Foster @ 2025-10-24 17:14 UTC (permalink / raw)
  To: Joanne Koong
  Cc: brauner, miklos, djwong, hch, hsiangkao, linux-block, gfs2,
	linux-fsdevel, kernel-team, linux-xfs, linux-doc

On Fri, Oct 24, 2025 at 09:25:13AM -0700, Joanne Koong wrote:
> On Thu, Oct 23, 2025 at 5:01 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > On Thu, Oct 23, 2025 at 12:30 PM Brian Foster <bfoster@redhat.com> wrote:
> > >
> > > On Thu, Sep 25, 2025 at 05:26:02PM -0700, Joanne Koong wrote:
> > > > Instead of incrementing read_bytes_pending for every folio range read in
> > > > (which requires acquiring the spinlock to do so), set read_bytes_pending
> > > > to the folio size when the first range is asynchronously read in, keep
> > > > track of how many bytes total are asynchronously read in, and adjust
> > > > read_bytes_pending accordingly after issuing requests to read in all the
> > > > necessary ranges.
> > > >
> > > > iomap_read_folio_ctx->cur_folio_in_bio can be removed since a non-zero
> > > > value for pending bytes necessarily indicates the folio is in the bio.
> > > >
> > > > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > > > Suggested-by: "Darrick J. Wong" <djwong@kernel.org>
> > > > ---
> > >
> > > Hi Joanne,
> > >
> > > I was throwing some extra testing at the vfs-6.19.iomap branch since the
> > > little merge conflict thing with iomap_iter_advance(). I end up hitting
> > > what appears to be a lockup on XFS with 1k FSB (-bsize=1k) running
> > > generic/051. It reproduces fairly reliably within a few iterations or so
> > > and seems to always stall during a read for a dedupe operation:
> > >
> > > task:fsstress        state:D stack:0     pid:12094 tgid:12094 ppid:12091  task_flags:0x400140 flags:0x00080003
> > > Call Trace:
> > >  <TASK>
> > >  __schedule+0x2fc/0x7a0
> > >  schedule+0x27/0x80
> > >  io_schedule+0x46/0x70
> > >  folio_wait_bit_common+0x12b/0x310
> > >  ? __pfx_wake_page_function+0x10/0x10
> > >  ? __pfx_xfs_vm_read_folio+0x10/0x10 [xfs]
> > >  filemap_read_folio+0x85/0xd0
> > >  ? __pfx_xfs_vm_read_folio+0x10/0x10 [xfs]
> > >  do_read_cache_folio+0x7c/0x1b0
> > >  vfs_dedupe_file_range_compare.constprop.0+0xaf/0x2d0
> > >  __generic_remap_file_range_prep+0x276/0x2a0
> > >  generic_remap_file_range_prep+0x10/0x20
> > >  xfs_reflink_remap_prep+0x22c/0x300 [xfs]
> > >  xfs_file_remap_range+0x84/0x360 [xfs]
> > >  vfs_dedupe_file_range_one+0x1b2/0x1d0
> > >  ? remap_verify_area+0x46/0x140
> > >  vfs_dedupe_file_range+0x162/0x220
> > >  do_vfs_ioctl+0x4d1/0x940
> > >  __x64_sys_ioctl+0x75/0xe0
> > >  do_syscall_64+0x84/0x800
> > >  ? do_syscall_64+0xbb/0x800
> > >  ? avc_has_perm_noaudit+0x6b/0xf0
> > >  ? _copy_to_user+0x31/0x40
> > >  ? cp_new_stat+0x130/0x170
> > >  ? __do_sys_newfstat+0x44/0x70
> > >  ? do_syscall_64+0xbb/0x800
> > >  ? do_syscall_64+0xbb/0x800
> > >  ? clear_bhb_loop+0x30/0x80
> > >  ? clear_bhb_loop+0x30/0x80
> > >  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > > RIP: 0033:0x7fe6bbd9a14d
> > > RSP: 002b:00007ffde72cd4e0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> > > RAX: ffffffffffffffda RBX: 0000000000000068 RCX: 00007fe6bbd9a14d
> > > RDX: 000000000a1394b0 RSI: 00000000c0189436 RDI: 0000000000000004
> > > RBP: 00007ffde72cd530 R08: 0000000000001000 R09: 000000000a11a3fc
> > > R10: 000000000001d6c0 R11: 0000000000000246 R12: 000000000a12cfb0
> > > R13: 000000000a12ba10 R14: 000000000a14e610 R15: 0000000000019000
> > >  </TASK>
> > >
> > > It wasn't immediately clear to me what the issue was so I bisected and
> > > it landed on this patch. It kind of looks like we're failing to unlock a
> > > folio at some point and then tripping over it later..? I can kill the
> > > fsstress process but then the umount ultimately gets stuck tossing
> > > pagecache [1], so the mount still ends up stuck indefinitely. Anyways,
> > > I'll poke at it some more but I figure you might be able to make sense
> > > of this faster than I can.
> > >
> > > Brian
> >
> > Hi Brian,
> >
> > Thanks for your report and the repro instructions. I will look into
> > this and report back what I find.
> 
> This is the fix:
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 4e6258fdb915..aa46fec8362d 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -445,6 +445,9 @@ static void iomap_read_end(struct folio *folio,
> size_t bytes_pending)
>                 bool end_read, uptodate;
>                 size_t bytes_accounted = folio_size(folio) - bytes_pending;
> 
> +               if (!bytes_accounted)
> +                       return;
> +
>                 spin_lock_irq(&ifs->state_lock);
> 
> 
> What I missed was that if all the bytes in the folio are non-uptodate
> and need to read in by the filesystem, then there's a bug where the
> read will be ended on the folio twice (in iomap_read_end() and when
> the filesystem calls iomap_finish_folio_write(), when only the
> filesystem should end the read), which does 2 folio unlocks which ends
> up locking the folio. Looking at the writeback patch that does a
> similar optimization [1], I miss the same thing there.
> 

Makes sense.. though a short comment wouldn't hurt in there. ;) I found
myself a little confused by the accounted vs. pending naming when
reading through that code. If I follow correctly, the intent is to refer
to the additional bytes accounted to read_bytes_pending via the init
(where it just accounts the whole folio up front) and pending refers to
submitted I/O.

Presumably that extra accounting doubly serves as the typical "don't
complete the op before the submitter is done processing" extra
reference, except in this full submit case of course. If so, that's
subtle enough in my mind that a sentence or two on it wouldn't hurt..

> I'll fix up both. Thanks for catching this and bisecting it down to
> this patch. Sorry for the trouble.
> 

No prob. Thanks for the fix!

Brian

> Thanks,
> Joanne
> 
> [1] https://lore.kernel.org/linux-fsdevel/20251009225611.3744728-4-joannelkoong@gmail.com/
> >
> > Thanks,
> > Joanne
> > >
> 


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v5 07/14] iomap: track pending read bytes more optimally
  2025-10-24 16:25       ` Joanne Koong
  2025-10-24 17:14         ` Brian Foster
@ 2025-10-24 17:21         ` Matthew Wilcox
  2025-10-24 19:22           ` Joanne Koong
  1 sibling, 1 reply; 50+ messages in thread
From: Matthew Wilcox @ 2025-10-24 17:21 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Brian Foster, brauner, miklos, djwong, hch, hsiangkao,
	linux-block, gfs2, linux-fsdevel, kernel-team, linux-xfs,
	linux-doc

On Fri, Oct 24, 2025 at 09:25:13AM -0700, Joanne Koong wrote:
> What I missed was that if all the bytes in the folio are non-uptodate
> and need to read in by the filesystem, then there's a bug where the
> read will be ended on the folio twice (in iomap_read_end() and when
> the filesystem calls iomap_finish_folio_write(), when only the
> filesystem should end the read), which does 2 folio unlocks which ends
> up locking the folio. Looking at the writeback patch that does a
> similar optimization [1], I miss the same thing there.

folio_unlock() contains:
        VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);

Feels like more filesystem people should be enabling CONFIG_DEBUG_VM
when testing (excluding performance testing of course; it'll do ugly
things to your performance numbers).

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v5 07/14] iomap: track pending read bytes more optimally
  2025-10-24 17:21         ` Matthew Wilcox
@ 2025-10-24 19:22           ` Joanne Koong
  2025-10-24 20:59             ` Matthew Wilcox
  0 siblings, 1 reply; 50+ messages in thread
From: Joanne Koong @ 2025-10-24 19:22 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Brian Foster, brauner, miklos, djwong, hch, hsiangkao,
	linux-block, gfs2, linux-fsdevel, kernel-team, linux-xfs,
	linux-doc

On Fri, Oct 24, 2025 at 10:21 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Oct 24, 2025 at 09:25:13AM -0700, Joanne Koong wrote:
> > What I missed was that if all the bytes in the folio are non-uptodate
> > and need to read in by the filesystem, then there's a bug where the
> > read will be ended on the folio twice (in iomap_read_end() and when
> > the filesystem calls iomap_finish_folio_write(), when only the
> > filesystem should end the read), which does 2 folio unlocks which ends
> > up locking the folio. Looking at the writeback patch that does a
> > similar optimization [1], I miss the same thing there.
>
> folio_unlock() contains:
>         VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
>
> Feels like more filesystem people should be enabling CONFIG_DEBUG_VM
> when testing (excluding performance testing of course; it'll do ugly
> things to your performance numbers).

Point taken. It looks like there's a bunch of other memory debugging
configs as well. Do you recommend enabling all of these when testing?
Do you have a particular .config you use for when you run tests?

Thanks,
Joanne

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v5 07/14] iomap: track pending read bytes more optimally
  2025-10-24 17:14         ` Brian Foster
@ 2025-10-24 19:48           ` Joanne Koong
  2025-10-24 21:55             ` Joanne Koong
  0 siblings, 1 reply; 50+ messages in thread
From: Joanne Koong @ 2025-10-24 19:48 UTC (permalink / raw)
  To: Brian Foster
  Cc: brauner, miklos, djwong, hch, hsiangkao, linux-block, gfs2,
	linux-fsdevel, kernel-team, linux-xfs, linux-doc

On Fri, Oct 24, 2025 at 10:10 AM Brian Foster <bfoster@redhat.com> wrote:
>
> On Fri, Oct 24, 2025 at 09:25:13AM -0700, Joanne Koong wrote:
> > On Thu, Oct 23, 2025 at 5:01 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > >
> > > On Thu, Oct 23, 2025 at 12:30 PM Brian Foster <bfoster@redhat.com> wrote:
> > > >
> > > > On Thu, Sep 25, 2025 at 05:26:02PM -0700, Joanne Koong wrote:
> > > > > Instead of incrementing read_bytes_pending for every folio range read in
> > > > > (which requires acquiring the spinlock to do so), set read_bytes_pending
> > > > > to the folio size when the first range is asynchronously read in, keep
> > > > > track of how many bytes total are asynchronously read in, and adjust
> > > > > read_bytes_pending accordingly after issuing requests to read in all the
> > > > > necessary ranges.
> > > > >
> > > > > iomap_read_folio_ctx->cur_folio_in_bio can be removed since a non-zero
> > > > > value for pending bytes necessarily indicates the folio is in the bio.
> > > > >
> > > > > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > > > > Suggested-by: "Darrick J. Wong" <djwong@kernel.org>
> > > > > ---
> > > >
> > > > Hi Joanne,
> > > >
> > > > I was throwing some extra testing at the vfs-6.19.iomap branch since the
> > > > little merge conflict thing with iomap_iter_advance(). I end up hitting
> > > > what appears to be a lockup on XFS with 1k FSB (-bsize=1k) running
> > > > generic/051. It reproduces fairly reliably within a few iterations or so
> > > > and seems to always stall during a read for a dedupe operation:
> > > >
> > > > task:fsstress        state:D stack:0     pid:12094 tgid:12094 ppid:12091  task_flags:0x400140 flags:0x00080003
> > > > Call Trace:
> > > >  <TASK>
> > > >  __schedule+0x2fc/0x7a0
> > > >  schedule+0x27/0x80
> > > >  io_schedule+0x46/0x70
> > > >  folio_wait_bit_common+0x12b/0x310
> > > >  ? __pfx_wake_page_function+0x10/0x10
> > > >  ? __pfx_xfs_vm_read_folio+0x10/0x10 [xfs]
> > > >  filemap_read_folio+0x85/0xd0
> > > >  ? __pfx_xfs_vm_read_folio+0x10/0x10 [xfs]
> > > >  do_read_cache_folio+0x7c/0x1b0
> > > >  vfs_dedupe_file_range_compare.constprop.0+0xaf/0x2d0
> > > >  __generic_remap_file_range_prep+0x276/0x2a0
> > > >  generic_remap_file_range_prep+0x10/0x20
> > > >  xfs_reflink_remap_prep+0x22c/0x300 [xfs]
> > > >  xfs_file_remap_range+0x84/0x360 [xfs]
> > > >  vfs_dedupe_file_range_one+0x1b2/0x1d0
> > > >  ? remap_verify_area+0x46/0x140
> > > >  vfs_dedupe_file_range+0x162/0x220
> > > >  do_vfs_ioctl+0x4d1/0x940
> > > >  __x64_sys_ioctl+0x75/0xe0
> > > >  do_syscall_64+0x84/0x800
> > > >  ? do_syscall_64+0xbb/0x800
> > > >  ? avc_has_perm_noaudit+0x6b/0xf0
> > > >  ? _copy_to_user+0x31/0x40
> > > >  ? cp_new_stat+0x130/0x170
> > > >  ? __do_sys_newfstat+0x44/0x70
> > > >  ? do_syscall_64+0xbb/0x800
> > > >  ? do_syscall_64+0xbb/0x800
> > > >  ? clear_bhb_loop+0x30/0x80
> > > >  ? clear_bhb_loop+0x30/0x80
> > > >  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > > > RIP: 0033:0x7fe6bbd9a14d
> > > > RSP: 002b:00007ffde72cd4e0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> > > > RAX: ffffffffffffffda RBX: 0000000000000068 RCX: 00007fe6bbd9a14d
> > > > RDX: 000000000a1394b0 RSI: 00000000c0189436 RDI: 0000000000000004
> > > > RBP: 00007ffde72cd530 R08: 0000000000001000 R09: 000000000a11a3fc
> > > > R10: 000000000001d6c0 R11: 0000000000000246 R12: 000000000a12cfb0
> > > > R13: 000000000a12ba10 R14: 000000000a14e610 R15: 0000000000019000
> > > >  </TASK>
> > > >
> > > > It wasn't immediately clear to me what the issue was so I bisected and
> > > > it landed on this patch. It kind of looks like we're failing to unlock a
> > > > folio at some point and then tripping over it later..? I can kill the
> > > > fsstress process but then the umount ultimately gets stuck tossing
> > > > pagecache [1], so the mount still ends up stuck indefinitely. Anyways,
> > > > I'll poke at it some more but I figure you might be able to make sense
> > > > of this faster than I can.
> > > >
> > > > Brian
> > >
> > > Hi Brian,
> > >
> > > Thanks for your report and the repro instructions. I will look into
> > > this and report back what I find.
> >
> > This is the fix:
> >
> > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > index 4e6258fdb915..aa46fec8362d 100644
> > --- a/fs/iomap/buffered-io.c
> > +++ b/fs/iomap/buffered-io.c
> > @@ -445,6 +445,9 @@ static void iomap_read_end(struct folio *folio,
> > size_t bytes_pending)
> >                 bool end_read, uptodate;
> >                 size_t bytes_accounted = folio_size(folio) - bytes_pending;
> >
> > +               if (!bytes_accounted)
> > +                       return;
> > +
> >                 spin_lock_irq(&ifs->state_lock);
> >
> >
> > What I missed was that if all the bytes in the folio are non-uptodate
> > and need to read in by the filesystem, then there's a bug where the
> > read will be ended on the folio twice (in iomap_read_end() and when
> > the filesystem calls iomap_finish_folio_write(), when only the
> > filesystem should end the read), which does 2 folio unlocks which ends
> > up locking the folio. Looking at the writeback patch that does a
> > similar optimization [1], I miss the same thing there.
> >
>
> Makes sense.. though a short comment wouldn't hurt in there. ;) I found
> myself a little confused by the accounted vs. pending naming when
> reading through that code. If I follow correctly, the intent is to refer
> to the additional bytes accounted to read_bytes_pending via the init
> (where it just accounts the whole folio up front) and pending refers to
> submitted I/O.
>
> Presumably that extra accounting doubly serves as the typical "don't
> complete the op before the submitter is done processing" extra
> reference, except in this full submit case of course. If so, that's
> subtle enough in my mind that a sentence or two on it wouldn't hurt..

I will add some a comment about this :) That's a good point about the
naming, maybe "bytes_submitted" and "bytes_unsubmitted" is a lot less
confusing than "bytes_pending" and "bytes_accounted".

Thanks,
Joanne

>
> > I'll fix up both. Thanks for catching this and bisecting it down to
> > this patch. Sorry for the trouble.
> >
>
> No prob. Thanks for the fix!
>
> Brian
>
> > Thanks,
> > Joanne
> >
> > [1] https://lore.kernel.org/linux-fsdevel/20251009225611.3744728-4-joannelkoong@gmail.com/
> > >
> > > Thanks,
> > > Joanne
> > > >
> >
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v5 07/14] iomap: track pending read bytes more optimally
  2025-10-24 19:22           ` Joanne Koong
@ 2025-10-24 20:59             ` Matthew Wilcox
  2025-10-24 21:37               ` Darrick J. Wong
  2025-10-24 21:58               ` Joanne Koong
  0 siblings, 2 replies; 50+ messages in thread
From: Matthew Wilcox @ 2025-10-24 20:59 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Brian Foster, brauner, miklos, djwong, hch, hsiangkao,
	linux-block, gfs2, linux-fsdevel, kernel-team, linux-xfs,
	linux-doc

On Fri, Oct 24, 2025 at 12:22:32PM -0700, Joanne Koong wrote:
> > Feels like more filesystem people should be enabling CONFIG_DEBUG_VM
> > when testing (excluding performance testing of course; it'll do ugly
> > things to your performance numbers).
> 
> Point taken. It looks like there's a bunch of other memory debugging
> configs as well. Do you recommend enabling all of these when testing?
> Do you have a particular .config you use for when you run tests?

Our Kconfig is far too ornate.  We could do with a "recommended for
kernel developers" profile.  Here's what I'm currently using, though I
know it's changed over time:

CONFIG_X86_DEBUGCTLMSR=y
CONFIG_PM_DEBUG=y
CONFIG_PM_SLEEP_DEBUG=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_BLK_DEBUG_FS=y
CONFIG_PNP_DEBUG_MESSAGES=y
CONFIG_SCSI_DEBUG=m
CONFIG_EXT4_DEBUG=y
CONFIG_JFS_DEBUG=y
CONFIG_XFS_DEBUG=y
CONFIG_BTRFS_DEBUG=y
CONFIG_UFS_DEBUG=y
CONFIG_DEBUG_BUGVERBOSE=y
CONFIG_DEBUG_KERNEL=y
CONFIG_DEBUG_MISC=y
CONFIG_DEBUG_INFO=y
CONFIG_DEBUG_INFO_DWARF4=y
CONFIG_DEBUG_INFO_COMPRESSED_NONE=y
CONFIG_DEBUG_FS=y
CONFIG_DEBUG_FS_ALLOW_ALL=y
CONFIG_ARCH_HAS_EARLY_DEBUG=y
CONFIG_SLUB_DEBUG=y
CONFIG_ARCH_HAS_DEBUG_WX=y
CONFIG_HAVE_DEBUG_KMEMLEAK=y
CONFIG_SHRINKER_DEBUG=y
CONFIG_ARCH_HAS_DEBUG_VM_PGTABLE=y
CONFIG_DEBUG_VM_IRQSOFF=y
CONFIG_DEBUG_VM=y
CONFIG_ARCH_HAS_DEBUG_VIRTUAL=y
CONFIG_DEBUG_MEMORY_INIT=y
CONFIG_LOCK_DEBUGGING_SUPPORT=y
CONFIG_DEBUG_RT_MUTEXES=y
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_MUTEXES=y
CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y
CONFIG_DEBUG_RWSEMS=y
CONFIG_DEBUG_LOCK_ALLOC=y
CONFIG_DEBUG_LIST=y
CONFIG_X86_DEBUG_FPU=y
CONFIG_FAULT_INJECTION_DEBUG_FS=y

(output from grep DEBUG .build/.config |grep -v ^#)

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v5 07/14] iomap: track pending read bytes more optimally
  2025-10-24 20:59             ` Matthew Wilcox
@ 2025-10-24 21:37               ` Darrick J. Wong
  2025-10-24 21:58               ` Joanne Koong
  1 sibling, 0 replies; 50+ messages in thread
From: Darrick J. Wong @ 2025-10-24 21:37 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Joanne Koong, Brian Foster, brauner, miklos, hch, hsiangkao,
	linux-block, gfs2, linux-fsdevel, kernel-team, linux-xfs,
	linux-doc

On Fri, Oct 24, 2025 at 09:59:01PM +0100, Matthew Wilcox wrote:
> On Fri, Oct 24, 2025 at 12:22:32PM -0700, Joanne Koong wrote:
> > > Feels like more filesystem people should be enabling CONFIG_DEBUG_VM
> > > when testing (excluding performance testing of course; it'll do ugly
> > > things to your performance numbers).
> > 
> > Point taken. It looks like there's a bunch of other memory debugging
> > configs as well. Do you recommend enabling all of these when testing?
> > Do you have a particular .config you use for when you run tests?
> 
> Our Kconfig is far too ornate.  We could do with a "recommended for
> kernel developers" profile.  Here's what I'm currently using, though I
> know it's changed over time:

Is there any chance you could split the VM debug checks into cheap and
expensive ones, and create another kconfig option so that we could do
the cheap checks without having fstests take a lot longer?

You could also implement shenanigans like the following:
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=djwong-wtf&id=b739fff870384fd239abfd99ecee6bc47640794d

To enable the expensive checks at runtime.

(Yeah, I know, this is probably a 2 year project + bikeshed score of at
least 30...)

--D

> CONFIG_X86_DEBUGCTLMSR=y
> CONFIG_PM_DEBUG=y
> CONFIG_PM_SLEEP_DEBUG=y
> CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
> CONFIG_BLK_DEBUG_FS=y
> CONFIG_PNP_DEBUG_MESSAGES=y
> CONFIG_SCSI_DEBUG=m
> CONFIG_EXT4_DEBUG=y
> CONFIG_JFS_DEBUG=y
> CONFIG_XFS_DEBUG=y
> CONFIG_BTRFS_DEBUG=y
> CONFIG_UFS_DEBUG=y
> CONFIG_DEBUG_BUGVERBOSE=y
> CONFIG_DEBUG_KERNEL=y
> CONFIG_DEBUG_MISC=y
> CONFIG_DEBUG_INFO=y
> CONFIG_DEBUG_INFO_DWARF4=y
> CONFIG_DEBUG_INFO_COMPRESSED_NONE=y
> CONFIG_DEBUG_FS=y
> CONFIG_DEBUG_FS_ALLOW_ALL=y
> CONFIG_ARCH_HAS_EARLY_DEBUG=y
> CONFIG_SLUB_DEBUG=y
> CONFIG_ARCH_HAS_DEBUG_WX=y
> CONFIG_HAVE_DEBUG_KMEMLEAK=y
> CONFIG_SHRINKER_DEBUG=y
> CONFIG_ARCH_HAS_DEBUG_VM_PGTABLE=y
> CONFIG_DEBUG_VM_IRQSOFF=y
> CONFIG_DEBUG_VM=y
> CONFIG_ARCH_HAS_DEBUG_VIRTUAL=y
> CONFIG_DEBUG_MEMORY_INIT=y
> CONFIG_LOCK_DEBUGGING_SUPPORT=y
> CONFIG_DEBUG_RT_MUTEXES=y
> CONFIG_DEBUG_SPINLOCK=y
> CONFIG_DEBUG_MUTEXES=y
> CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y
> CONFIG_DEBUG_RWSEMS=y
> CONFIG_DEBUG_LOCK_ALLOC=y
> CONFIG_DEBUG_LIST=y
> CONFIG_X86_DEBUG_FPU=y
> CONFIG_FAULT_INJECTION_DEBUG_FS=y
> 
> (output from grep DEBUG .build/.config |grep -v ^#)

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v5 07/14] iomap: track pending read bytes more optimally
  2025-10-24 19:48           ` Joanne Koong
@ 2025-10-24 21:55             ` Joanne Koong
  2025-10-27 12:16               ` Brian Foster
  0 siblings, 1 reply; 50+ messages in thread
From: Joanne Koong @ 2025-10-24 21:55 UTC (permalink / raw)
  To: Brian Foster
  Cc: brauner, miklos, djwong, hch, hsiangkao, linux-block, gfs2,
	linux-fsdevel, kernel-team, linux-xfs, linux-doc

On Fri, Oct 24, 2025 at 12:48 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Fri, Oct 24, 2025 at 10:10 AM Brian Foster <bfoster@redhat.com> wrote:
> >
> > On Fri, Oct 24, 2025 at 09:25:13AM -0700, Joanne Koong wrote:
> > > On Thu, Oct 23, 2025 at 5:01 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > > >
> > > > On Thu, Oct 23, 2025 at 12:30 PM Brian Foster <bfoster@redhat.com> wrote:
> > > > >
> > > > > On Thu, Sep 25, 2025 at 05:26:02PM -0700, Joanne Koong wrote:
> > > > > > Instead of incrementing read_bytes_pending for every folio range read in
> > > > > > (which requires acquiring the spinlock to do so), set read_bytes_pending
> > > > > > to the folio size when the first range is asynchronously read in, keep
> > > > > > track of how many bytes total are asynchronously read in, and adjust
> > > > > > read_bytes_pending accordingly after issuing requests to read in all the
> > > > > > necessary ranges.
> > > > > >
> > > > > > iomap_read_folio_ctx->cur_folio_in_bio can be removed since a non-zero
> > > > > > value for pending bytes necessarily indicates the folio is in the bio.
> > > > > >
> > > > > > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > > > > > Suggested-by: "Darrick J. Wong" <djwong@kernel.org>
> > > > > > ---
> > > > >
> > > > > Hi Joanne,
> > > > >
> > > > > I was throwing some extra testing at the vfs-6.19.iomap branch since the
> > > > > little merge conflict thing with iomap_iter_advance(). I end up hitting
> > > > > what appears to be a lockup on XFS with 1k FSB (-bsize=1k) running
> > > > > generic/051. It reproduces fairly reliably within a few iterations or so
> > > > > and seems to always stall during a read for a dedupe operation:
> > > > >
> > > > > task:fsstress        state:D stack:0     pid:12094 tgid:12094 ppid:12091  task_flags:0x400140 flags:0x00080003
> > > > > Call Trace:
> > > > >  <TASK>
> > > > >  __schedule+0x2fc/0x7a0
> > > > >  schedule+0x27/0x80
> > > > >  io_schedule+0x46/0x70
> > > > >  folio_wait_bit_common+0x12b/0x310
> > > > >  ? __pfx_wake_page_function+0x10/0x10
> > > > >  ? __pfx_xfs_vm_read_folio+0x10/0x10 [xfs]
> > > > >  filemap_read_folio+0x85/0xd0
> > > > >  ? __pfx_xfs_vm_read_folio+0x10/0x10 [xfs]
> > > > >  do_read_cache_folio+0x7c/0x1b0
> > > > >  vfs_dedupe_file_range_compare.constprop.0+0xaf/0x2d0
> > > > >  __generic_remap_file_range_prep+0x276/0x2a0
> > > > >  generic_remap_file_range_prep+0x10/0x20
> > > > >  xfs_reflink_remap_prep+0x22c/0x300 [xfs]
> > > > >  xfs_file_remap_range+0x84/0x360 [xfs]
> > > > >  vfs_dedupe_file_range_one+0x1b2/0x1d0
> > > > >  ? remap_verify_area+0x46/0x140
> > > > >  vfs_dedupe_file_range+0x162/0x220
> > > > >  do_vfs_ioctl+0x4d1/0x940
> > > > >  __x64_sys_ioctl+0x75/0xe0
> > > > >  do_syscall_64+0x84/0x800
> > > > >  ? do_syscall_64+0xbb/0x800
> > > > >  ? avc_has_perm_noaudit+0x6b/0xf0
> > > > >  ? _copy_to_user+0x31/0x40
> > > > >  ? cp_new_stat+0x130/0x170
> > > > >  ? __do_sys_newfstat+0x44/0x70
> > > > >  ? do_syscall_64+0xbb/0x800
> > > > >  ? do_syscall_64+0xbb/0x800
> > > > >  ? clear_bhb_loop+0x30/0x80
> > > > >  ? clear_bhb_loop+0x30/0x80
> > > > >  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > > > > RIP: 0033:0x7fe6bbd9a14d
> > > > > RSP: 002b:00007ffde72cd4e0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> > > > > RAX: ffffffffffffffda RBX: 0000000000000068 RCX: 00007fe6bbd9a14d
> > > > > RDX: 000000000a1394b0 RSI: 00000000c0189436 RDI: 0000000000000004
> > > > > RBP: 00007ffde72cd530 R08: 0000000000001000 R09: 000000000a11a3fc
> > > > > R10: 000000000001d6c0 R11: 0000000000000246 R12: 000000000a12cfb0
> > > > > R13: 000000000a12ba10 R14: 000000000a14e610 R15: 0000000000019000
> > > > >  </TASK>
> > > > >
> > > > > It wasn't immediately clear to me what the issue was so I bisected and
> > > > > it landed on this patch. It kind of looks like we're failing to unlock a
> > > > > folio at some point and then tripping over it later..? I can kill the
> > > > > fsstress process but then the umount ultimately gets stuck tossing
> > > > > pagecache [1], so the mount still ends up stuck indefinitely. Anyways,
> > > > > I'll poke at it some more but I figure you might be able to make sense
> > > > > of this faster than I can.
> > > > >
> > > > > Brian
> > > >
> > > > Hi Brian,
> > > >
> > > > Thanks for your report and the repro instructions. I will look into
> > > > this and report back what I find.
> > >
> > > This is the fix:
> > >
> > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > > index 4e6258fdb915..aa46fec8362d 100644
> > > --- a/fs/iomap/buffered-io.c
> > > +++ b/fs/iomap/buffered-io.c
> > > @@ -445,6 +445,9 @@ static void iomap_read_end(struct folio *folio,
> > > size_t bytes_pending)
> > >                 bool end_read, uptodate;
> > >                 size_t bytes_accounted = folio_size(folio) - bytes_pending;
> > >
> > > +               if (!bytes_accounted)
> > > +                       return;
> > > +
> > >                 spin_lock_irq(&ifs->state_lock);
> > >
> > >
> > > What I missed was that if all the bytes in the folio are non-uptodate
> > > and need to read in by the filesystem, then there's a bug where the
> > > read will be ended on the folio twice (in iomap_read_end() and when
> > > the filesystem calls iomap_finish_folio_write(), when only the
> > > filesystem should end the read), which does 2 folio unlocks which ends
> > > up locking the folio. Looking at the writeback patch that does a
> > > similar optimization [1], I miss the same thing there.
> > >
> >
> > Makes sense.. though a short comment wouldn't hurt in there. ;) I found
> > myself a little confused by the accounted vs. pending naming when
> > reading through that code. If I follow correctly, the intent is to refer
> > to the additional bytes accounted to read_bytes_pending via the init
> > (where it just accounts the whole folio up front) and pending refers to
> > submitted I/O.
> >
> > Presumably that extra accounting doubly serves as the typical "don't
> > complete the op before the submitter is done processing" extra
> > reference, except in this full submit case of course. If so, that's
> > subtle enough in my mind that a sentence or two on it wouldn't hurt..
>
> I will add some a comment about this :) That's a good point about the
> naming, maybe "bytes_submitted" and "bytes_unsubmitted" is a lot less
> confusing than "bytes_pending" and "bytes_accounted".

Thinking about this some more, bytes_unsubmitted sounds even more
confusing, so maybe bytes_nonsubmitted or bytes_not_submitted. I'll
think about this some more but kept it as pending/accounted for now.

The fix for this bug is here [1].

Thanks,
Joanne

[1] https://lore.kernel.org/linux-fsdevel/20251024215008.3844068-1-joannelkoong@gmail.com/

>
> Thanks,
> Joanne
>
> >
> > > I'll fix up both. Thanks for catching this and bisecting it down to
> > > this patch. Sorry for the trouble.
> > >
> >
> > No prob. Thanks for the fix!
> >
> > Brian
> >
> > > Thanks,
> > > Joanne
> > >
> > > [1] https://lore.kernel.org/linux-fsdevel/20251009225611.3744728-4-joannelkoong@gmail.com/
> > > >
> > > > Thanks,
> > > > Joanne
> > > > >
> > >
> >

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v5 07/14] iomap: track pending read bytes more optimally
  2025-10-24 20:59             ` Matthew Wilcox
  2025-10-24 21:37               ` Darrick J. Wong
@ 2025-10-24 21:58               ` Joanne Koong
  1 sibling, 0 replies; 50+ messages in thread
From: Joanne Koong @ 2025-10-24 21:58 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Brian Foster, brauner, miklos, djwong, hch, hsiangkao,
	linux-block, gfs2, linux-fsdevel, kernel-team, linux-xfs,
	linux-doc

On Fri, Oct 24, 2025 at 1:59 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Oct 24, 2025 at 12:22:32PM -0700, Joanne Koong wrote:
> > > Feels like more filesystem people should be enabling CONFIG_DEBUG_VM
> > > when testing (excluding performance testing of course; it'll do ugly
> > > things to your performance numbers).
> >
> > Point taken. It looks like there's a bunch of other memory debugging
> > configs as well. Do you recommend enabling all of these when testing?
> > Do you have a particular .config you use for when you run tests?
>
> Our Kconfig is far too ornate.  We could do with a "recommended for
> kernel developers" profile.  Here's what I'm currently using, though I
> know it's changed over time:
>
> CONFIG_X86_DEBUGCTLMSR=y
> CONFIG_PM_DEBUG=y
> CONFIG_PM_SLEEP_DEBUG=y
> CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
> CONFIG_BLK_DEBUG_FS=y
> CONFIG_PNP_DEBUG_MESSAGES=y
> CONFIG_SCSI_DEBUG=m
> CONFIG_EXT4_DEBUG=y
> CONFIG_JFS_DEBUG=y
> CONFIG_XFS_DEBUG=y
> CONFIG_BTRFS_DEBUG=y
> CONFIG_UFS_DEBUG=y
> CONFIG_DEBUG_BUGVERBOSE=y
> CONFIG_DEBUG_KERNEL=y
> CONFIG_DEBUG_MISC=y
> CONFIG_DEBUG_INFO=y
> CONFIG_DEBUG_INFO_DWARF4=y
> CONFIG_DEBUG_INFO_COMPRESSED_NONE=y
> CONFIG_DEBUG_FS=y
> CONFIG_DEBUG_FS_ALLOW_ALL=y
> CONFIG_ARCH_HAS_EARLY_DEBUG=y
> CONFIG_SLUB_DEBUG=y
> CONFIG_ARCH_HAS_DEBUG_WX=y
> CONFIG_HAVE_DEBUG_KMEMLEAK=y
> CONFIG_SHRINKER_DEBUG=y
> CONFIG_ARCH_HAS_DEBUG_VM_PGTABLE=y
> CONFIG_DEBUG_VM_IRQSOFF=y
> CONFIG_DEBUG_VM=y
> CONFIG_ARCH_HAS_DEBUG_VIRTUAL=y
> CONFIG_DEBUG_MEMORY_INIT=y
> CONFIG_LOCK_DEBUGGING_SUPPORT=y
> CONFIG_DEBUG_RT_MUTEXES=y
> CONFIG_DEBUG_SPINLOCK=y
> CONFIG_DEBUG_MUTEXES=y
> CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y
> CONFIG_DEBUG_RWSEMS=y
> CONFIG_DEBUG_LOCK_ALLOC=y
> CONFIG_DEBUG_LIST=y
> CONFIG_X86_DEBUG_FPU=y
> CONFIG_FAULT_INJECTION_DEBUG_FS=y
>
> (output from grep DEBUG .build/.config |grep -v ^#)

Thank you, I'll copy this.
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v5 07/14] iomap: track pending read bytes more optimally
  2025-10-24 21:55             ` Joanne Koong
@ 2025-10-27 12:16               ` Brian Foster
  0 siblings, 0 replies; 50+ messages in thread
From: Brian Foster @ 2025-10-27 12:16 UTC (permalink / raw)
  To: Joanne Koong
  Cc: brauner, miklos, djwong, hch, hsiangkao, linux-block, gfs2,
	linux-fsdevel, kernel-team, linux-xfs, linux-doc

On Fri, Oct 24, 2025 at 02:55:20PM -0700, Joanne Koong wrote:
> On Fri, Oct 24, 2025 at 12:48 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > On Fri, Oct 24, 2025 at 10:10 AM Brian Foster <bfoster@redhat.com> wrote:
> > >
> > > On Fri, Oct 24, 2025 at 09:25:13AM -0700, Joanne Koong wrote:
> > > > On Thu, Oct 23, 2025 at 5:01 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > > > >
> > > > > On Thu, Oct 23, 2025 at 12:30 PM Brian Foster <bfoster@redhat.com> wrote:
> > > > > >
> > > > > > On Thu, Sep 25, 2025 at 05:26:02PM -0700, Joanne Koong wrote:
> > > > > > > Instead of incrementing read_bytes_pending for every folio range read in
> > > > > > > (which requires acquiring the spinlock to do so), set read_bytes_pending
> > > > > > > to the folio size when the first range is asynchronously read in, keep
> > > > > > > track of how many bytes total are asynchronously read in, and adjust
> > > > > > > read_bytes_pending accordingly after issuing requests to read in all the
> > > > > > > necessary ranges.
> > > > > > >
> > > > > > > iomap_read_folio_ctx->cur_folio_in_bio can be removed since a non-zero
> > > > > > > value for pending bytes necessarily indicates the folio is in the bio.
> > > > > > >
> > > > > > > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > > > > > > Suggested-by: "Darrick J. Wong" <djwong@kernel.org>
> > > > > > > ---
> > > > > >
> > > > > > Hi Joanne,
> > > > > >
> > > > > > I was throwing some extra testing at the vfs-6.19.iomap branch since the
> > > > > > little merge conflict thing with iomap_iter_advance(). I end up hitting
> > > > > > what appears to be a lockup on XFS with 1k FSB (-bsize=1k) running
> > > > > > generic/051. It reproduces fairly reliably within a few iterations or so
> > > > > > and seems to always stall during a read for a dedupe operation:
> > > > > >
> > > > > > task:fsstress        state:D stack:0     pid:12094 tgid:12094 ppid:12091  task_flags:0x400140 flags:0x00080003
> > > > > > Call Trace:
> > > > > >  <TASK>
> > > > > >  __schedule+0x2fc/0x7a0
> > > > > >  schedule+0x27/0x80
> > > > > >  io_schedule+0x46/0x70
> > > > > >  folio_wait_bit_common+0x12b/0x310
> > > > > >  ? __pfx_wake_page_function+0x10/0x10
> > > > > >  ? __pfx_xfs_vm_read_folio+0x10/0x10 [xfs]
> > > > > >  filemap_read_folio+0x85/0xd0
> > > > > >  ? __pfx_xfs_vm_read_folio+0x10/0x10 [xfs]
> > > > > >  do_read_cache_folio+0x7c/0x1b0
> > > > > >  vfs_dedupe_file_range_compare.constprop.0+0xaf/0x2d0
> > > > > >  __generic_remap_file_range_prep+0x276/0x2a0
> > > > > >  generic_remap_file_range_prep+0x10/0x20
> > > > > >  xfs_reflink_remap_prep+0x22c/0x300 [xfs]
> > > > > >  xfs_file_remap_range+0x84/0x360 [xfs]
> > > > > >  vfs_dedupe_file_range_one+0x1b2/0x1d0
> > > > > >  ? remap_verify_area+0x46/0x140
> > > > > >  vfs_dedupe_file_range+0x162/0x220
> > > > > >  do_vfs_ioctl+0x4d1/0x940
> > > > > >  __x64_sys_ioctl+0x75/0xe0
> > > > > >  do_syscall_64+0x84/0x800
> > > > > >  ? do_syscall_64+0xbb/0x800
> > > > > >  ? avc_has_perm_noaudit+0x6b/0xf0
> > > > > >  ? _copy_to_user+0x31/0x40
> > > > > >  ? cp_new_stat+0x130/0x170
> > > > > >  ? __do_sys_newfstat+0x44/0x70
> > > > > >  ? do_syscall_64+0xbb/0x800
> > > > > >  ? do_syscall_64+0xbb/0x800
> > > > > >  ? clear_bhb_loop+0x30/0x80
> > > > > >  ? clear_bhb_loop+0x30/0x80
> > > > > >  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > > > > > RIP: 0033:0x7fe6bbd9a14d
> > > > > > RSP: 002b:00007ffde72cd4e0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> > > > > > RAX: ffffffffffffffda RBX: 0000000000000068 RCX: 00007fe6bbd9a14d
> > > > > > RDX: 000000000a1394b0 RSI: 00000000c0189436 RDI: 0000000000000004
> > > > > > RBP: 00007ffde72cd530 R08: 0000000000001000 R09: 000000000a11a3fc
> > > > > > R10: 000000000001d6c0 R11: 0000000000000246 R12: 000000000a12cfb0
> > > > > > R13: 000000000a12ba10 R14: 000000000a14e610 R15: 0000000000019000
> > > > > >  </TASK>
> > > > > >
> > > > > > It wasn't immediately clear to me what the issue was so I bisected and
> > > > > > it landed on this patch. It kind of looks like we're failing to unlock a
> > > > > > folio at some point and then tripping over it later..? I can kill the
> > > > > > fsstress process but then the umount ultimately gets stuck tossing
> > > > > > pagecache [1], so the mount still ends up stuck indefinitely. Anyways,
> > > > > > I'll poke at it some more but I figure you might be able to make sense
> > > > > > of this faster than I can.
> > > > > >
> > > > > > Brian
> > > > >
> > > > > Hi Brian,
> > > > >
> > > > > Thanks for your report and the repro instructions. I will look into
> > > > > this and report back what I find.
> > > >
> > > > This is the fix:
> > > >
> > > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > > > index 4e6258fdb915..aa46fec8362d 100644
> > > > --- a/fs/iomap/buffered-io.c
> > > > +++ b/fs/iomap/buffered-io.c
> > > > @@ -445,6 +445,9 @@ static void iomap_read_end(struct folio *folio,
> > > > size_t bytes_pending)
> > > >                 bool end_read, uptodate;
> > > >                 size_t bytes_accounted = folio_size(folio) - bytes_pending;
> > > >
> > > > +               if (!bytes_accounted)
> > > > +                       return;
> > > > +
> > > >                 spin_lock_irq(&ifs->state_lock);
> > > >
> > > >
> > > > What I missed was that if all the bytes in the folio are non-uptodate
> > > > and need to read in by the filesystem, then there's a bug where the
> > > > read will be ended on the folio twice (in iomap_read_end() and when
> > > > the filesystem calls iomap_finish_folio_write(), when only the
> > > > filesystem should end the read), which does 2 folio unlocks which ends
> > > > up locking the folio. Looking at the writeback patch that does a
> > > > similar optimization [1], I miss the same thing there.
> > > >
> > >
> > > Makes sense.. though a short comment wouldn't hurt in there. ;) I found
> > > myself a little confused by the accounted vs. pending naming when
> > > reading through that code. If I follow correctly, the intent is to refer
> > > to the additional bytes accounted to read_bytes_pending via the init
> > > (where it just accounts the whole folio up front) and pending refers to
> > > submitted I/O.
> > >
> > > Presumably that extra accounting doubly serves as the typical "don't
> > > complete the op before the submitter is done processing" extra
> > > reference, except in this full submit case of course. If so, that's
> > > subtle enough in my mind that a sentence or two on it wouldn't hurt..
> >
> > I will add some a comment about this :) That's a good point about the
> > naming, maybe "bytes_submitted" and "bytes_unsubmitted" is a lot less
> > confusing than "bytes_pending" and "bytes_accounted".
> 
> Thinking about this some more, bytes_unsubmitted sounds even more
> confusing, so maybe bytes_nonsubmitted or bytes_not_submitted. I'll
> think about this some more but kept it as pending/accounted for now.
> 

bytes_submitted sounds better than pending to me, not sure about
unsubmitted or whatever. As long as there's a sentence or two that
explains what accounted means in the end helper, though, that seems
reasonable enough to me.

Brian

> The fix for this bug is here [1].
> 
> Thanks,
> Joanne
> 
> [1] https://lore.kernel.org/linux-fsdevel/20251024215008.3844068-1-joannelkoong@gmail.com/
> 
> >
> > Thanks,
> > Joanne
> >
> > >
> > > > I'll fix up both. Thanks for catching this and bisecting it down to
> > > > this patch. Sorry for the trouble.
> > > >
> > >
> > > No prob. Thanks for the fix!
> > >
> > > Brian
> > >
> > > > Thanks,
> > > > Joanne
> > > >
> > > > [1] https://lore.kernel.org/linux-fsdevel/20251009225611.3744728-4-joannelkoong@gmail.com/
> > > > >
> > > > > Thanks,
> > > > > Joanne
> > > > > >
> > > >
> > >
> 


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [RFC PATCH 0/1] iomap: fix race between iomap_set_range_uptodate and folio_end_read
  2025-09-26  0:26 ` [PATCH v5 12/14] fuse: use iomap for read_folio Joanne Koong
@ 2025-12-23 22:30   ` Sasha Levin
  2025-12-23 22:30     ` [RFC PATCH 1/1] " Sasha Levin
  0 siblings, 1 reply; 50+ messages in thread
From: Sasha Levin @ 2025-12-23 22:30 UTC (permalink / raw)
  To: joannelkoong; +Cc: willy, linux-fsdevel, linux-kernel, Sasha Levin

Hi Joanne,

While testing with your FUSE iomap patchset that recently landed upstream,
I ran into a warning in ifs_free() where the folio's uptodate flag didn't
match the ifs per-block uptodate bitmap. The warning was triggered during
FUSE-based filesystem unmount when running the LTP writev03 test.

After some investigation, I believe the root cause is a race condition
that has existed since commit 7a4847e54cc1 ("iomap: use folio_end_read()")
but was difficult to trigger until now. The issue is that folio_end_read()
uses XOR semantics to set the uptodate bit, so if iomap_set_range_uptodate()
calls folio_mark_uptodate() while a read is in progress, the subsequent
folio_end_read() will XOR and clear the uptodate bit.

The FUSE iomap enablement seems to have created the right conditions to
expose this race - likely due to different file extent patterns in
FUSE-based filesystems (like NTFS-3G) compared to native filesystems
like XFS/ext4.

The fix checks read_bytes_pending under the state_lock in
iomap_set_range_uptodate() and skips calling folio_mark_uptodate() if a
read is in progress, letting the read completion path handle it.

I'm not very familiar with the iomap internals, so I'd really appreciate
your review and feedback on whether this approach is correct.

Thanks,
Sasha

Sasha Levin (1):
  iomap: fix race between iomap_set_range_uptodate and folio_end_read

 fs/fuse/dev.c          |  3 +-
 fs/fuse/file.c         |  6 ++--
 fs/iomap/buffered-io.c | 65 +++++++++++++++++++++++++++++++++++++++---
 include/linux/iomap.h  |  2 ++
 4 files changed, 68 insertions(+), 8 deletions(-)

-- 
2.51.0

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [RFC PATCH 1/1] iomap: fix race between iomap_set_range_uptodate and folio_end_read
  2025-12-23 22:30   ` [RFC PATCH 0/1] iomap: fix race between iomap_set_range_uptodate and folio_end_read Sasha Levin
@ 2025-12-23 22:30     ` Sasha Levin
  2025-12-24  1:12       ` Joanne Koong
  0 siblings, 1 reply; 50+ messages in thread
From: Sasha Levin @ 2025-12-23 22:30 UTC (permalink / raw)
  To: joannelkoong; +Cc: willy, linux-fsdevel, linux-kernel, Sasha Levin

When iomap uses large folios, per-block uptodate tracking is managed via
iomap_folio_state (ifs). A race condition can cause the ifs uptodate bits
to become inconsistent with the folio's uptodate flag.

The race occurs because folio_end_read() uses XOR semantics to atomically
set the uptodate bit and clear the locked bit:

  Thread A (read completion):          Thread B (concurrent write):
  --------------------------------     --------------------------------
  iomap_finish_folio_read()
    spin_lock(state_lock)
    ifs_set_range_uptodate() -> true
    spin_unlock(state_lock)
                                       iomap_set_range_uptodate()
                                         spin_lock(state_lock)
                                         ifs_set_range_uptodate() -> true
                                         spin_unlock(state_lock)
                                         folio_mark_uptodate(folio)
    folio_end_read(folio, true)
      folio_xor_flags()  // XOR CLEARS uptodate!

Result: folio is NOT uptodate, but ifs says all blocks ARE uptodate.

Fix by checking read_bytes_pending in iomap_set_range_uptodate() under the
lock. If a read is in progress, skip calling folio_mark_uptodate() - the
read completion path will handle it via folio_end_read().

The warning was triggered during FUSE-based filesystem (e.g., NTFS-3G)
unmount when the LTP writev03 test was run:

  WARNING: fs/iomap/buffered-io.c at ifs_free
  Call trace:
   ifs_free
   iomap_invalidate_folio
   truncate_cleanup_folio
   truncate_inode_pages_range
   truncate_inode_pages_final
   fuse_evict_inode
   ...
   fuse_kill_sb_blk

Fixes: 7a4847e54cc1 ("iomap: use folio_end_read()")
Assisted-by: claude-opus-4-5-20251101
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/fuse/dev.c          |  3 +-
 fs/fuse/file.c         |  6 ++--
 fs/iomap/buffered-io.c | 65 +++++++++++++++++++++++++++++++++++++++---
 include/linux/iomap.h  |  2 ++
 4 files changed, 68 insertions(+), 8 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 6d59cbc877c6..50e84e913589 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -11,6 +11,7 @@
 #include "fuse_dev_i.h"
 
 #include <linux/init.h>
+#include <linux/iomap.h>
 #include <linux/module.h>
 #include <linux/poll.h>
 #include <linux/sched/signal.h>
@@ -1820,7 +1821,7 @@ static int fuse_notify_store(struct fuse_conn *fc, unsigned int size,
 		if (!folio_test_uptodate(folio) && !err && offset == 0 &&
 		    (nr_bytes == folio_size(folio) || file_size == end)) {
 			folio_zero_segment(folio, nr_bytes, folio_size(folio));
-			folio_mark_uptodate(folio);
+			iomap_set_range_uptodate(folio, 0, folio_size(folio));
 		}
 		folio_unlock(folio);
 		folio_put(folio);
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 01bc894e9c2b..3abe38416199 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1216,13 +1216,13 @@ static ssize_t fuse_send_write_pages(struct fuse_io_args *ia,
 		struct folio *folio = ap->folios[i];
 
 		if (err) {
-			folio_clear_uptodate(folio);
+			iomap_clear_folio_uptodate(folio);
 		} else {
 			if (count >= folio_size(folio) - offset)
 				count -= folio_size(folio) - offset;
 			else {
 				if (short_write)
-					folio_clear_uptodate(folio);
+					iomap_clear_folio_uptodate(folio);
 				count = 0;
 			}
 			offset = 0;
@@ -1305,7 +1305,7 @@ static ssize_t fuse_fill_write_pages(struct fuse_io_args *ia,
 
 		/* If we copied full folio, mark it uptodate */
 		if (tmp == folio_size(folio))
-			folio_mark_uptodate(folio);
+			iomap_set_range_uptodate(folio, 0, folio_size(folio));
 
 		if (folio_test_uptodate(folio)) {
 			folio_unlock(folio);
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index e5c1ca440d93..7ceda24cf6a7 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -74,8 +74,7 @@ static bool ifs_set_range_uptodate(struct folio *folio,
 	return ifs_is_fully_uptodate(folio, ifs);
 }
 
-static void iomap_set_range_uptodate(struct folio *folio, size_t off,
-		size_t len)
+void iomap_set_range_uptodate(struct folio *folio, size_t off, size_t len)
 {
 	struct iomap_folio_state *ifs = folio->private;
 	unsigned long flags;
@@ -87,12 +86,50 @@ static void iomap_set_range_uptodate(struct folio *folio, size_t off,
 	if (ifs) {
 		spin_lock_irqsave(&ifs->state_lock, flags);
 		uptodate = ifs_set_range_uptodate(folio, ifs, off, len);
+		/*
+		 * If a read is in progress, we must NOT call folio_mark_uptodate
+		 * here. The read completion path (iomap_finish_folio_read or
+		 * iomap_read_end) will call folio_end_read() which uses XOR
+		 * semantics to set the uptodate bit. If we set it here, the XOR
+		 * in folio_end_read() will clear it, leaving the folio not
+		 * uptodate while the ifs says all blocks are uptodate.
+		 */
+		if (uptodate && ifs->read_bytes_pending)
+			uptodate = false;
 		spin_unlock_irqrestore(&ifs->state_lock, flags);
 	}
 
 	if (uptodate)
 		folio_mark_uptodate(folio);
 }
+EXPORT_SYMBOL_GPL(iomap_set_range_uptodate);
+
+void iomap_clear_folio_uptodate(struct folio *folio)
+{
+	struct iomap_folio_state *ifs = folio->private;
+
+	if (ifs) {
+		struct inode *inode = folio->mapping->host;
+		unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
+		unsigned long flags;
+
+		spin_lock_irqsave(&ifs->state_lock, flags);
+		/*
+		 * If a read is in progress, don't clear the uptodate state.
+		 * The read completion path will handle the folio state, and
+		 * clearing here would race with iomap_finish_folio_read()
+		 * potentially causing ifs/folio uptodate state mismatch.
+		 */
+		if (ifs->read_bytes_pending) {
+			spin_unlock_irqrestore(&ifs->state_lock, flags);
+			return;
+		}
+		bitmap_clear(ifs->state, 0, nr_blocks);
+		spin_unlock_irqrestore(&ifs->state_lock, flags);
+	}
+	folio_clear_uptodate(folio);
+}
+EXPORT_SYMBOL_GPL(iomap_clear_folio_uptodate);
 
 /*
  * Find the next dirty block in the folio. end_blk is inclusive.
@@ -399,8 +436,17 @@ void iomap_finish_folio_read(struct folio *folio, size_t off, size_t len,
 		spin_unlock_irqrestore(&ifs->state_lock, flags);
 	}
 
-	if (finished)
+	if (finished) {
+		/*
+		 * If uptodate is true but the folio is already marked uptodate,
+		 * folio_end_read's XOR semantics would clear the uptodate bit.
+		 * This should never happen because iomap_set_range_uptodate()
+		 * skips calling folio_mark_uptodate() when read_bytes_pending
+		 * is non-zero, ensuring only the read completion path sets it.
+		 */
+		WARN_ON_ONCE(uptodate && folio_test_uptodate(folio));
 		folio_end_read(folio, uptodate);
+	}
 }
 EXPORT_SYMBOL_GPL(iomap_finish_folio_read);
 
@@ -481,8 +527,19 @@ static void iomap_read_end(struct folio *folio, size_t bytes_submitted)
 		if (end_read)
 			uptodate = ifs_is_fully_uptodate(folio, ifs);
 		spin_unlock_irq(&ifs->state_lock);
-		if (end_read)
+		if (end_read) {
+			/*
+			 * If uptodate is true but the folio is already marked
+			 * uptodate, folio_end_read's XOR semantics would clear
+			 * the uptodate bit. This should never happen because
+			 * iomap_set_range_uptodate() skips calling
+			 * folio_mark_uptodate() when read_bytes_pending is
+			 * non-zero, ensuring only the read completion path
+			 * sets it.
+			 */
+			WARN_ON_ONCE(uptodate && folio_test_uptodate(folio));
 			folio_end_read(folio, uptodate);
+		}
 	} else if (!bytes_submitted) {
 		/*
 		 * If there were no bytes submitted, this means we are
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 520e967cb501..3c2ad88d16b6 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -345,6 +345,8 @@ void iomap_read_folio(const struct iomap_ops *ops,
 void iomap_readahead(const struct iomap_ops *ops,
 		struct iomap_read_folio_ctx *ctx);
 bool iomap_is_partially_uptodate(struct folio *, size_t from, size_t count);
+void iomap_set_range_uptodate(struct folio *folio, size_t off, size_t len);
+void iomap_clear_folio_uptodate(struct folio *folio);
 struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len);
 bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags);
 void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len);
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 1/1] iomap: fix race between iomap_set_range_uptodate and folio_end_read
  2025-12-23 22:30     ` [RFC PATCH 1/1] " Sasha Levin
@ 2025-12-24  1:12       ` Joanne Koong
  2025-12-24  1:31         ` Sasha Levin
  2025-12-24  2:10         ` Matthew Wilcox
  0 siblings, 2 replies; 50+ messages in thread
From: Joanne Koong @ 2025-12-24  1:12 UTC (permalink / raw)
  To: Sasha Levin; +Cc: willy, linux-fsdevel, linux-kernel

On Tue, Dec 23, 2025 at 2:30 PM Sasha Levin <sashal@kernel.org> wrote:
>

Hi Sasha,

Thanks for your patch and for the detailed writeup.

> When iomap uses large folios, per-block uptodate tracking is managed via
> iomap_folio_state (ifs). A race condition can cause the ifs uptodate bits
> to become inconsistent with the folio's uptodate flag.
>
> The race occurs because folio_end_read() uses XOR semantics to atomically
> set the uptodate bit and clear the locked bit:
>
>   Thread A (read completion):          Thread B (concurrent write):
>   --------------------------------     --------------------------------
>   iomap_finish_folio_read()
>     spin_lock(state_lock)
>     ifs_set_range_uptodate() -> true
>     spin_unlock(state_lock)
>                                        iomap_set_range_uptodate()
>                                          spin_lock(state_lock)
>                                          ifs_set_range_uptodate() -> true
>                                          spin_unlock(state_lock)
>                                          folio_mark_uptodate(folio)
>     folio_end_read(folio, true)
>       folio_xor_flags()  // XOR CLEARS uptodate!

The part I'm confused about here is how this can happen between a
concurrent read and write. My understanding is that the folio is
locked when the read occurs and locked when the write occurs and both
locks get dropped only when the read or write finishes. Looking at
iomap code, I see iomap_set_range_uptodate() getting called in
__iomap_write_begin() and __iomap_write_end() for the writes, but in
both those places the folio lock is held while this is called. I'm not
seeing how the read and write race in the diagram can happen, but
maybe I'm missing something here?

>
> Result: folio is NOT uptodate, but ifs says all blocks ARE uptodate.

Ah I see the WARN_ON_ONCE() in ifs_free:
        WARN_ON_ONCE(ifs_is_fully_uptodate(folio, ifs) !=
                        folio_test_uptodate(folio));

Just to confirm, are you seeing that the folio is not marked uptodate
but the ifs blocks are? Or are the ifs blocks not uptodate but the
folio is?

>
> Fix by checking read_bytes_pending in iomap_set_range_uptodate() under the
> lock. If a read is in progress, skip calling folio_mark_uptodate() - the
> read completion path will handle it via folio_end_read().
>
> The warning was triggered during FUSE-based filesystem (e.g., NTFS-3G)
> unmount when the LTP writev03 test was run:
>
>   WARNING: fs/iomap/buffered-io.c at ifs_free
>   Call trace:
>    ifs_free
>    iomap_invalidate_folio
>    truncate_cleanup_folio
>    truncate_inode_pages_range
>    truncate_inode_pages_final
>    fuse_evict_inode
>    ...
>    fuse_kill_sb_blk
>
> Fixes: 7a4847e54cc1 ("iomap: use folio_end_read()")
> Assisted-by: claude-opus-4-5-20251101
> Signed-off-by: Sasha Levin <sashal@kernel.org>
> ---
>  fs/fuse/dev.c          |  3 +-
>  fs/fuse/file.c         |  6 ++--
>  fs/iomap/buffered-io.c | 65 +++++++++++++++++++++++++++++++++++++++---
>  include/linux/iomap.h  |  2 ++
>  4 files changed, 68 insertions(+), 8 deletions(-)
>
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 6d59cbc877c6..50e84e913589 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -11,6 +11,7 @@
>  #include "fuse_dev_i.h"
>
>  #include <linux/init.h>
> +#include <linux/iomap.h>
>  #include <linux/module.h>
>  #include <linux/poll.h>
>  #include <linux/sched/signal.h>
> @@ -1820,7 +1821,7 @@ static int fuse_notify_store(struct fuse_conn *fc, unsigned int size,
>                 if (!folio_test_uptodate(folio) && !err && offset == 0 &&
>                     (nr_bytes == folio_size(folio) || file_size == end)) {
>                         folio_zero_segment(folio, nr_bytes, folio_size(folio));
> -                       folio_mark_uptodate(folio);
> +                       iomap_set_range_uptodate(folio, 0, folio_size(folio));
>                 }
>                 folio_unlock(folio);
>                 folio_put(folio);
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 01bc894e9c2b..3abe38416199 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1216,13 +1216,13 @@ static ssize_t fuse_send_write_pages(struct fuse_io_args *ia,
>                 struct folio *folio = ap->folios[i];
>
>                 if (err) {
> -                       folio_clear_uptodate(folio);
> +                       iomap_clear_folio_uptodate(folio);
>                 } else {
>                         if (count >= folio_size(folio) - offset)
>                                 count -= folio_size(folio) - offset;
>                         else {
>                                 if (short_write)
> -                                       folio_clear_uptodate(folio);
> +                                       iomap_clear_folio_uptodate(folio);
>                                 count = 0;
>                         }
>                         offset = 0;
> @@ -1305,7 +1305,7 @@ static ssize_t fuse_fill_write_pages(struct fuse_io_args *ia,
>
>                 /* If we copied full folio, mark it uptodate */
>                 if (tmp == folio_size(folio))
> -                       folio_mark_uptodate(folio);
> +                       iomap_set_range_uptodate(folio, 0, folio_size(folio));
>
>                 if (folio_test_uptodate(folio)) {
>                         folio_unlock(folio);
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index e5c1ca440d93..7ceda24cf6a7 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -74,8 +74,7 @@ static bool ifs_set_range_uptodate(struct folio *folio,
>         return ifs_is_fully_uptodate(folio, ifs);
>  }
>
> -static void iomap_set_range_uptodate(struct folio *folio, size_t off,
> -               size_t len)
> +void iomap_set_range_uptodate(struct folio *folio, size_t off, size_t len)
>  {
>         struct iomap_folio_state *ifs = folio->private;
>         unsigned long flags;
> @@ -87,12 +86,50 @@ static void iomap_set_range_uptodate(struct folio *folio, size_t off,
>         if (ifs) {
>                 spin_lock_irqsave(&ifs->state_lock, flags);
>                 uptodate = ifs_set_range_uptodate(folio, ifs, off, len);
> +               /*
> +                * If a read is in progress, we must NOT call folio_mark_uptodate
> +                * here. The read completion path (iomap_finish_folio_read or
> +                * iomap_read_end) will call folio_end_read() which uses XOR
> +                * semantics to set the uptodate bit. If we set it here, the XOR
> +                * in folio_end_read() will clear it, leaving the folio not
> +                * uptodate while the ifs says all blocks are uptodate.
> +                */
> +               if (uptodate && ifs->read_bytes_pending)
> +                       uptodate = false;

Does the warning you saw in ifs_free() still go away without the
changes here to iomap_set_range_uptodate() or is this change here
necessary?  I'm asking mostly because I'm not seeing how
iomap_set_range_uptodate() can be called while the read is in
progress, as the logic should be already protected by the folio locks.

>                 spin_unlock_irqrestore(&ifs->state_lock, flags);
>         }
>
>         if (uptodate)
>                 folio_mark_uptodate(folio);
>  }
> +EXPORT_SYMBOL_GPL(iomap_set_range_uptodate);
> +
> +void iomap_clear_folio_uptodate(struct folio *folio)
> +{
> +       struct iomap_folio_state *ifs = folio->private;
> +
> +       if (ifs) {
> +               struct inode *inode = folio->mapping->host;
> +               unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
> +               unsigned long flags;
> +
> +               spin_lock_irqsave(&ifs->state_lock, flags);
> +               /*
> +                * If a read is in progress, don't clear the uptodate state.
> +                * The read completion path will handle the folio state, and
> +                * clearing here would race with iomap_finish_folio_read()
> +                * potentially causing ifs/folio uptodate state mismatch.
> +                */
> +               if (ifs->read_bytes_pending) {
> +                       spin_unlock_irqrestore(&ifs->state_lock, flags);
> +                       return;
> +               }
> +               bitmap_clear(ifs->state, 0, nr_blocks);
> +               spin_unlock_irqrestore(&ifs->state_lock, flags);
> +       }
> +       folio_clear_uptodate(folio);
> +}
> +EXPORT_SYMBOL_GPL(iomap_clear_folio_uptodate);
>
>  /*
>   * Find the next dirty block in the folio. end_blk is inclusive.
> @@ -399,8 +436,17 @@ void iomap_finish_folio_read(struct folio *folio, size_t off, size_t len,
>                 spin_unlock_irqrestore(&ifs->state_lock, flags);
>         }
>
> -       if (finished)
> +       if (finished) {
> +               /*
> +                * If uptodate is true but the folio is already marked uptodate,
> +                * folio_end_read's XOR semantics would clear the uptodate bit.
> +                * This should never happen because iomap_set_range_uptodate()
> +                * skips calling folio_mark_uptodate() when read_bytes_pending
> +                * is non-zero, ensuring only the read completion path sets it.
> +                */
> +               WARN_ON_ONCE(uptodate && folio_test_uptodate(folio));

Matthew pointed out in another thread [1] that folio_end_read() has
already the warnings against double-unlocks or double-uptodates
in-built:

        VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
        VM_BUG_ON_FOLIO(success && folio_test_uptodate(folio), folio);

but imo the WARN_ON_ONCE() here is nice to have too, as I don't think
most builds enable CONFIG_DEBUG_VM.

[1] https://lore.kernel.org/linux-fsdevel/aPu1ilw6Tq6tKPrf@casper.infradead.org/

Thanks,
Joanne
>                 folio_end_read(folio, uptodate);
> +       }
>  }
>  EXPORT_SYMBOL_GPL(iomap_finish_folio_read);
>
> @@ -481,8 +527,19 @@ static void iomap_read_end(struct folio *folio, size_t bytes_submitted)
>                 if (end_read)
>                         uptodate = ifs_is_fully_uptodate(folio, ifs);
>                 spin_unlock_irq(&ifs->state_lock);
> -               if (end_read)
> +               if (end_read) {
> +                       /*
> +                        * If uptodate is true but the folio is already marked
> +                        * uptodate, folio_end_read's XOR semantics would clear
> +                        * the uptodate bit. This should never happen because
> +                        * iomap_set_range_uptodate() skips calling
> +                        * folio_mark_uptodate() when read_bytes_pending is
> +                        * non-zero, ensuring only the read completion path
> +                        * sets it.
> +                        */
> +                       WARN_ON_ONCE(uptodate && folio_test_uptodate(folio));
>                         folio_end_read(folio, uptodate);
> +               }
>         } else if (!bytes_submitted) {
>                 /*
>                  * If there were no bytes submitted, this means we are
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 520e967cb501..3c2ad88d16b6 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -345,6 +345,8 @@ void iomap_read_folio(const struct iomap_ops *ops,
>  void iomap_readahead(const struct iomap_ops *ops,
>                 struct iomap_read_folio_ctx *ctx);
>  bool iomap_is_partially_uptodate(struct folio *, size_t from, size_t count);
> +void iomap_set_range_uptodate(struct folio *folio, size_t off, size_t len);
> +void iomap_clear_folio_uptodate(struct folio *folio);
>  struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len);
>  bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags);
>  void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len);
> --
> 2.51.0
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 1/1] iomap: fix race between iomap_set_range_uptodate and folio_end_read
  2025-12-24  1:12       ` Joanne Koong
@ 2025-12-24  1:31         ` Sasha Levin
  2026-02-07  7:16           ` Wei Gao
  2025-12-24  2:10         ` Matthew Wilcox
  1 sibling, 1 reply; 50+ messages in thread
From: Sasha Levin @ 2025-12-24  1:31 UTC (permalink / raw)
  To: Joanne Koong; +Cc: willy, linux-fsdevel, linux-kernel

On Tue, Dec 23, 2025 at 05:12:09PM -0800, Joanne Koong wrote:
>On Tue, Dec 23, 2025 at 2:30 PM Sasha Levin <sashal@kernel.org> wrote:
>>
>
>Hi Sasha,
>
>Thanks for your patch and for the detailed writeup.

Thanks for looking into this!

>> When iomap uses large folios, per-block uptodate tracking is managed via
>> iomap_folio_state (ifs). A race condition can cause the ifs uptodate bits
>> to become inconsistent with the folio's uptodate flag.
>>
>> The race occurs because folio_end_read() uses XOR semantics to atomically
>> set the uptodate bit and clear the locked bit:
>>
>>   Thread A (read completion):          Thread B (concurrent write):
>>   --------------------------------     --------------------------------
>>   iomap_finish_folio_read()
>>     spin_lock(state_lock)
>>     ifs_set_range_uptodate() -> true
>>     spin_unlock(state_lock)
>>                                        iomap_set_range_uptodate()
>>                                          spin_lock(state_lock)
>>                                          ifs_set_range_uptodate() -> true
>>                                          spin_unlock(state_lock)
>>                                          folio_mark_uptodate(folio)
>>     folio_end_read(folio, true)
>>       folio_xor_flags()  // XOR CLEARS uptodate!
>
>The part I'm confused about here is how this can happen between a
>concurrent read and write. My understanding is that the folio is
>locked when the read occurs and locked when the write occurs and both
>locks get dropped only when the read or write finishes. Looking at
>iomap code, I see iomap_set_range_uptodate() getting called in
>__iomap_write_begin() and __iomap_write_end() for the writes, but in
>both those places the folio lock is held while this is called. I'm not
>seeing how the read and write race in the diagram can happen, but
>maybe I'm missing something here?

Hmm, you're right... The folio lock should prevent concurrent read/write
access. Looking at this again, I suspect that FUSE was calling
folio_clear_uptodate() and folio_mark_uptodate() directly without updating the
ifs bits. For example, in fuse_send_write_pages() on write error, it calls
folio_clear_uptodate(folio) which clears the folio flag but leaves ifs still
showing all blocks uptodate?

>>
>> Result: folio is NOT uptodate, but ifs says all blocks ARE uptodate.
>
>Ah I see the WARN_ON_ONCE() in ifs_free:
>        WARN_ON_ONCE(ifs_is_fully_uptodate(folio, ifs) !=
>                        folio_test_uptodate(folio));
>
>Just to confirm, are you seeing that the folio is not marked uptodate
>but the ifs blocks are? Or are the ifs blocks not uptodate but the
>folio is?

The former: folio is NOT uptodate but ifs shows all blocks ARE uptodate
(state=0xffff with 16 blocks)

>>
>> Fix by checking read_bytes_pending in iomap_set_range_uptodate() under the
>> lock. If a read is in progress, skip calling folio_mark_uptodate() - the
>> read completion path will handle it via folio_end_read().
>>
>> The warning was triggered during FUSE-based filesystem (e.g., NTFS-3G)
>> unmount when the LTP writev03 test was run:
>>
>>   WARNING: fs/iomap/buffered-io.c at ifs_free
>>   Call trace:
>>    ifs_free
>>    iomap_invalidate_folio
>>    truncate_cleanup_folio
>>    truncate_inode_pages_range
>>    truncate_inode_pages_final
>>    fuse_evict_inode
>>    ...
>>    fuse_kill_sb_blk
>>
>> Fixes: 7a4847e54cc1 ("iomap: use folio_end_read()")
>> Assisted-by: claude-opus-4-5-20251101
>> Signed-off-by: Sasha Levin <sashal@kernel.org>
>> ---
>>  fs/fuse/dev.c          |  3 +-
>>  fs/fuse/file.c         |  6 ++--
>>  fs/iomap/buffered-io.c | 65 +++++++++++++++++++++++++++++++++++++++---
>>  include/linux/iomap.h  |  2 ++
>>  4 files changed, 68 insertions(+), 8 deletions(-)
>>
>> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
>> index 6d59cbc877c6..50e84e913589 100644
>> --- a/fs/fuse/dev.c
>> +++ b/fs/fuse/dev.c
>> @@ -11,6 +11,7 @@
>>  #include "fuse_dev_i.h"
>>
>>  #include <linux/init.h>
>> +#include <linux/iomap.h>
>>  #include <linux/module.h>
>>  #include <linux/poll.h>
>>  #include <linux/sched/signal.h>
>> @@ -1820,7 +1821,7 @@ static int fuse_notify_store(struct fuse_conn *fc, unsigned int size,
>>                 if (!folio_test_uptodate(folio) && !err && offset == 0 &&
>>                     (nr_bytes == folio_size(folio) || file_size == end)) {
>>                         folio_zero_segment(folio, nr_bytes, folio_size(folio));
>> -                       folio_mark_uptodate(folio);
>> +                       iomap_set_range_uptodate(folio, 0, folio_size(folio));
>>                 }
>>                 folio_unlock(folio);
>>                 folio_put(folio);
>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
>> index 01bc894e9c2b..3abe38416199 100644
>> --- a/fs/fuse/file.c
>> +++ b/fs/fuse/file.c
>> @@ -1216,13 +1216,13 @@ static ssize_t fuse_send_write_pages(struct fuse_io_args *ia,
>>                 struct folio *folio = ap->folios[i];
>>
>>                 if (err) {
>> -                       folio_clear_uptodate(folio);
>> +                       iomap_clear_folio_uptodate(folio);
>>                 } else {
>>                         if (count >= folio_size(folio) - offset)
>>                                 count -= folio_size(folio) - offset;
>>                         else {
>>                                 if (short_write)
>> -                                       folio_clear_uptodate(folio);
>> +                                       iomap_clear_folio_uptodate(folio);
>>                                 count = 0;
>>                         }
>>                         offset = 0;
>> @@ -1305,7 +1305,7 @@ static ssize_t fuse_fill_write_pages(struct fuse_io_args *ia,
>>
>>                 /* If we copied full folio, mark it uptodate */
>>                 if (tmp == folio_size(folio))
>> -                       folio_mark_uptodate(folio);
>> +                       iomap_set_range_uptodate(folio, 0, folio_size(folio));
>>
>>                 if (folio_test_uptodate(folio)) {
>>                         folio_unlock(folio);
>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
>> index e5c1ca440d93..7ceda24cf6a7 100644
>> --- a/fs/iomap/buffered-io.c
>> +++ b/fs/iomap/buffered-io.c
>> @@ -74,8 +74,7 @@ static bool ifs_set_range_uptodate(struct folio *folio,
>>         return ifs_is_fully_uptodate(folio, ifs);
>>  }
>>
>> -static void iomap_set_range_uptodate(struct folio *folio, size_t off,
>> -               size_t len)
>> +void iomap_set_range_uptodate(struct folio *folio, size_t off, size_t len)
>>  {
>>         struct iomap_folio_state *ifs = folio->private;
>>         unsigned long flags;
>> @@ -87,12 +86,50 @@ static void iomap_set_range_uptodate(struct folio *folio, size_t off,
>>         if (ifs) {
>>                 spin_lock_irqsave(&ifs->state_lock, flags);
>>                 uptodate = ifs_set_range_uptodate(folio, ifs, off, len);
>> +               /*
>> +                * If a read is in progress, we must NOT call folio_mark_uptodate
>> +                * here. The read completion path (iomap_finish_folio_read or
>> +                * iomap_read_end) will call folio_end_read() which uses XOR
>> +                * semantics to set the uptodate bit. If we set it here, the XOR
>> +                * in folio_end_read() will clear it, leaving the folio not
>> +                * uptodate while the ifs says all blocks are uptodate.
>> +                */
>> +               if (uptodate && ifs->read_bytes_pending)
>> +                       uptodate = false;
>
>Does the warning you saw in ifs_free() still go away without the
>changes here to iomap_set_range_uptodate() or is this change here
>necessary?  I'm asking mostly because I'm not seeing how
>iomap_set_range_uptodate() can be called while the read is in
>progress, as the logic should be already protected by the folio locks.

Yes, the warning goes away even without this part. I don't think that this is
necessary - I just kept it while figuring out the race.

>>                 spin_unlock_irqrestore(&ifs->state_lock, flags);
>>         }
>>
>>         if (uptodate)
>>                 folio_mark_uptodate(folio);
>>  }
>> +EXPORT_SYMBOL_GPL(iomap_set_range_uptodate);
>> +
>> +void iomap_clear_folio_uptodate(struct folio *folio)
>> +{
>> +       struct iomap_folio_state *ifs = folio->private;
>> +
>> +       if (ifs) {
>> +               struct inode *inode = folio->mapping->host;
>> +               unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
>> +               unsigned long flags;
>> +
>> +               spin_lock_irqsave(&ifs->state_lock, flags);
>> +               /*
>> +                * If a read is in progress, don't clear the uptodate state.
>> +                * The read completion path will handle the folio state, and
>> +                * clearing here would race with iomap_finish_folio_read()
>> +                * potentially causing ifs/folio uptodate state mismatch.
>> +                */
>> +               if (ifs->read_bytes_pending) {
>> +                       spin_unlock_irqrestore(&ifs->state_lock, flags);
>> +                       return;
>> +               }
>> +               bitmap_clear(ifs->state, 0, nr_blocks);
>> +               spin_unlock_irqrestore(&ifs->state_lock, flags);
>> +       }
>> +       folio_clear_uptodate(folio);
>> +}
>> +EXPORT_SYMBOL_GPL(iomap_clear_folio_uptodate);
>>
>>  /*
>>   * Find the next dirty block in the folio. end_blk is inclusive.
>> @@ -399,8 +436,17 @@ void iomap_finish_folio_read(struct folio *folio, size_t off, size_t len,
>>                 spin_unlock_irqrestore(&ifs->state_lock, flags);
>>         }
>>
>> -       if (finished)
>> +       if (finished) {
>> +               /*
>> +                * If uptodate is true but the folio is already marked uptodate,
>> +                * folio_end_read's XOR semantics would clear the uptodate bit.
>> +                * This should never happen because iomap_set_range_uptodate()
>> +                * skips calling folio_mark_uptodate() when read_bytes_pending
>> +                * is non-zero, ensuring only the read completion path sets it.
>> +                */
>> +               WARN_ON_ONCE(uptodate && folio_test_uptodate(folio));
>
>Matthew pointed out in another thread [1] that folio_end_read() has
>already the warnings against double-unlocks or double-uptodates
>in-built:
>
>        VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
>        VM_BUG_ON_FOLIO(success && folio_test_uptodate(folio), folio);
>
>but imo the WARN_ON_ONCE() here is nice to have too, as I don't think
>most builds enable CONFIG_DEBUG_VM.
>
>[1] https://lore.kernel.org/linux-fsdevel/aPu1ilw6Tq6tKPrf@casper.infradead.org/
>
>Thanks,
>Joanne
>>                 folio_end_read(folio, uptodate);
>> +       }
>>  }
>>  EXPORT_SYMBOL_GPL(iomap_finish_folio_read);
>>
>> @@ -481,8 +527,19 @@ static void iomap_read_end(struct folio *folio, size_t bytes_submitted)
>>                 if (end_read)
>>                         uptodate = ifs_is_fully_uptodate(folio, ifs);
>>                 spin_unlock_irq(&ifs->state_lock);
>> -               if (end_read)
>> +               if (end_read) {
>> +                       /*
>> +                        * If uptodate is true but the folio is already marked
>> +                        * uptodate, folio_end_read's XOR semantics would clear
>> +                        * the uptodate bit. This should never happen because
>> +                        * iomap_set_range_uptodate() skips calling
>> +                        * folio_mark_uptodate() when read_bytes_pending is
>> +                        * non-zero, ensuring only the read completion path
>> +                        * sets it.
>> +                        */
>> +                       WARN_ON_ONCE(uptodate && folio_test_uptodate(folio));
>>                         folio_end_read(folio, uptodate);
>> +               }
>>         } else if (!bytes_submitted) {
>>                 /*
>>                  * If there were no bytes submitted, this means we are
>> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
>> index 520e967cb501..3c2ad88d16b6 100644
>> --- a/include/linux/iomap.h
>> +++ b/include/linux/iomap.h
>> @@ -345,6 +345,8 @@ void iomap_read_folio(const struct iomap_ops *ops,
>>  void iomap_readahead(const struct iomap_ops *ops,
>>                 struct iomap_read_folio_ctx *ctx);
>>  bool iomap_is_partially_uptodate(struct folio *, size_t from, size_t count);
>> +void iomap_set_range_uptodate(struct folio *folio, size_t off, size_t len);
>> +void iomap_clear_folio_uptodate(struct folio *folio);
>>  struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len);
>>  bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags);
>>  void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len);
>> --
>> 2.51.0
>>

-- 
Thanks,
Sasha

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 1/1] iomap: fix race between iomap_set_range_uptodate and folio_end_read
  2025-12-24  1:12       ` Joanne Koong
  2025-12-24  1:31         ` Sasha Levin
@ 2025-12-24  2:10         ` Matthew Wilcox
  2025-12-24 15:43           ` Sasha Levin
  1 sibling, 1 reply; 50+ messages in thread
From: Matthew Wilcox @ 2025-12-24  2:10 UTC (permalink / raw)
  To: Joanne Koong; +Cc: Sasha Levin, linux-fsdevel, linux-kernel

On Tue, Dec 23, 2025 at 05:12:09PM -0800, Joanne Koong wrote:
> On Tue, Dec 23, 2025 at 2:30 PM Sasha Levin <sashal@kernel.org> wrote:
> >
> 
> Hi Sasha,
> 
> Thanks for your patch and for the detailed writeup.

The important line to note is:

Assisted-by: claude-opus-4-5-20251101

So Sasha has produced a very convincingly worded writeup that's
hallucinated.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 1/1] iomap: fix race between iomap_set_range_uptodate and folio_end_read
  2025-12-24  2:10         ` Matthew Wilcox
@ 2025-12-24 15:43           ` Sasha Levin
  2025-12-24 17:27             ` Matthew Wilcox
  0 siblings, 1 reply; 50+ messages in thread
From: Sasha Levin @ 2025-12-24 15:43 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Joanne Koong, linux-fsdevel, linux-kernel

On Wed, Dec 24, 2025 at 02:10:19AM +0000, Matthew Wilcox wrote:
>On Tue, Dec 23, 2025 at 05:12:09PM -0800, Joanne Koong wrote:
>> On Tue, Dec 23, 2025 at 2:30 PM Sasha Levin <sashal@kernel.org> wrote:
>> >
>>
>> Hi Sasha,
>>
>> Thanks for your patch and for the detailed writeup.
>
>The important line to note is:
>
>Assisted-by: claude-opus-4-5-20251101
>
>So Sasha has produced a very convincingly worded writeup that's
>hallucinated.

And spent a few hours trying to figure it out so I could unblock testing, but
sure - thanks.

Here's the full log:
https://qa-reports.linaro.org/lkft/sashal-linus-next/build/v6.18-rc7-13806-gb927546677c8/testrun/30618654/suite/log-parser-test/test/exception-warning-fsiomapbuffered-io-at-ifs_free/log
, happy to test any patches you might have.

-- 
Thanks,
Sasha

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 1/1] iomap: fix race between iomap_set_range_uptodate and folio_end_read
  2025-12-24 15:43           ` Sasha Levin
@ 2025-12-24 17:27             ` Matthew Wilcox
  2025-12-24 21:21               ` Sasha Levin
  0 siblings, 1 reply; 50+ messages in thread
From: Matthew Wilcox @ 2025-12-24 17:27 UTC (permalink / raw)
  To: Sasha Levin; +Cc: Joanne Koong, linux-fsdevel, linux-kernel

On Wed, Dec 24, 2025 at 10:43:58AM -0500, Sasha Levin wrote:
> On Wed, Dec 24, 2025 at 02:10:19AM +0000, Matthew Wilcox wrote:
> > So Sasha has produced a very convincingly worded writeup that's
> > hallucinated.
> 
> And spent a few hours trying to figure it out so I could unblock testing, but
> sure - thanks.

When you produce a convincingly worded writeup that's utterly wrong,
and have a reputation for using AI, that's the kind of reaction you're
going to get.

> Here's the full log:
> https://qa-reports.linaro.org/lkft/sashal-linus-next/build/v6.18-rc7-13806-gb927546677c8/testrun/30618654/suite/log-parser-test/test/exception-warning-fsiomapbuffered-io-at-ifs_free/log
> , happy to test any patches you might have.

That's actually much more helpful because it removes your incorrect
assumptions about what's going on.

 WARNING: fs/iomap/buffered-io.c:254 at ifs_free+0x130/0x148, CPU#0: msync04/406

That's this one:

        WARN_ON_ONCE(ifs_is_fully_uptodate(folio, ifs) !=
                        folio_test_uptodate(folio));

which would be fully explained by fuse calling folio_clear_uptodate()
in fuse_send_write_pages().  I have come to believe that allowing
filesystems to call folio_clear_uptodate() is just dangerous.  It
causes assertions to fire all over the place (eg if the page is mapped
into memory, the MM contains assertions that it must be uptodate).

So I think the first step is simply to delete the folio_clear_uptodate()
calls in fuse:

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 01bc894e9c2b..b819ede407d5 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1194,7 +1194,6 @@ static ssize_t fuse_send_write_pages(struct fuse_io_args *ia,
 	struct fuse_file *ff = file->private_data;
 	struct fuse_mount *fm = ff->fm;
 	unsigned int offset, i;
-	bool short_write;
 	int err;
 
 	for (i = 0; i < ap->num_folios; i++)
@@ -1209,22 +1208,16 @@ static ssize_t fuse_send_write_pages(struct fuse_io_args *ia,
 	if (!err && ia->write.out.size > count)
 		err = -EIO;
 
-	short_write = ia->write.out.size < count;
 	offset = ap->descs[0].offset;
 	count = ia->write.out.size;
 	for (i = 0; i < ap->num_folios; i++) {
 		struct folio *folio = ap->folios[i];
 
-		if (err) {
-			folio_clear_uptodate(folio);
-		} else {
+		if (!err) {
 			if (count >= folio_size(folio) - offset)
 				count -= folio_size(folio) - offset;
-			else {
-				if (short_write)
-					folio_clear_uptodate(folio);
+			else
 				count = 0;
-			}
 			offset = 0;
 		}
 		if (ia->write.folio_locked && (i == ap->num_folios - 1))

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 1/1] iomap: fix race between iomap_set_range_uptodate and folio_end_read
  2025-12-24 17:27             ` Matthew Wilcox
@ 2025-12-24 21:21               ` Sasha Levin
  2025-12-30  0:58                 ` Joanne Koong
  0 siblings, 1 reply; 50+ messages in thread
From: Sasha Levin @ 2025-12-24 21:21 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Joanne Koong, linux-fsdevel, linux-kernel

On Wed, Dec 24, 2025 at 05:27:03PM +0000, Matthew Wilcox wrote:
>On Wed, Dec 24, 2025 at 10:43:58AM -0500, Sasha Levin wrote:
>> On Wed, Dec 24, 2025 at 02:10:19AM +0000, Matthew Wilcox wrote:
>> > So Sasha has produced a very convincingly worded writeup that's
>> > hallucinated.
>>
>> And spent a few hours trying to figure it out so I could unblock testing, but
>> sure - thanks.
>
>When you produce a convincingly worded writeup that's utterly wrong,
>and have a reputation for using AI, that's the kind of reaction you're
>going to get.

A rude and unprofessional one?

>> Here's the full log:
>> https://qa-reports.linaro.org/lkft/sashal-linus-next/build/v6.18-rc7-13806-gb927546677c8/testrun/30618654/suite/log-parser-test/test/exception-warning-fsiomapbuffered-io-at-ifs_free/log
>> , happy to test any patches you might have.
>
>That's actually much more helpful because it removes your incorrect
>assumptions about what's going on.
>
> WARNING: fs/iomap/buffered-io.c:254 at ifs_free+0x130/0x148, CPU#0: msync04/406
>
>That's this one:
>
>        WARN_ON_ONCE(ifs_is_fully_uptodate(folio, ifs) !=
>                        folio_test_uptodate(folio));
>
>which would be fully explained by fuse calling folio_clear_uptodate()
>in fuse_send_write_pages().  I have come to believe that allowing
>filesystems to call folio_clear_uptodate() is just dangerous.  It
>causes assertions to fire all over the place (eg if the page is mapped
>into memory, the MM contains assertions that it must be uptodate).
>
>So I think the first step is simply to delete the folio_clear_uptodate()
>calls in fuse:

[snip]

Here's the log of a run with the change you've provided applied: https://qa-reports.linaro.org/lkft/sashal-linus-next/build/v6.18-rc7-13807-g26a15474eb13/testrun/30620754/suite/log-parser-test/test/exception-warning-fsiomapbuffered-io-at-ifs_free/log

-- 
Thanks,
Sasha

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 1/1] iomap: fix race between iomap_set_range_uptodate and folio_end_read
  2025-12-24 21:21               ` Sasha Levin
@ 2025-12-30  0:58                 ` Joanne Koong
  0 siblings, 0 replies; 50+ messages in thread
From: Joanne Koong @ 2025-12-30  0:58 UTC (permalink / raw)
  To: Sasha Levin; +Cc: Matthew Wilcox, linux-fsdevel, linux-kernel

On Wed, Dec 24, 2025 at 1:21 PM Sasha Levin <sashal@kernel.org> wrote:
>
> On Wed, Dec 24, 2025 at 05:27:03PM +0000, Matthew Wilcox wrote:
> >
> > WARNING: fs/iomap/buffered-io.c:254 at ifs_free+0x130/0x148, CPU#0: msync04/406
> >
> >That's this one:
> >
> >        WARN_ON_ONCE(ifs_is_fully_uptodate(folio, ifs) !=
> >                        folio_test_uptodate(folio));
> >
> >which would be fully explained by fuse calling folio_clear_uptodate()
> >in fuse_send_write_pages().  I have come to believe that allowing
> >filesystems to call folio_clear_uptodate() is just dangerous.  It
> >causes assertions to fire all over the place (eg if the page is mapped
> >into memory, the MM contains assertions that it must be uptodate).
> >
> >So I think the first step is simply to delete the folio_clear_uptodate()
> >calls in fuse:

Hmm... this fuse_perform_write() call path is for writethrough. In
writethrough, fuse first writes the data to the page cache and then to
the server. I think because we're doing the writes in that order (eg
first to the page cache, then the server), the clear uptodate is
needed if the server write is a short write or an error since we can't
revert the page cache data back to its original content (eg we want to
write 2 KB starting at offset 0, the folio representing that in the
page cache is uptodate, we retrieve that folio and write 2 KB to it,
then when we try writing it to the server, the server can only write
out 1 KB, where now there's a discrepancy between the page cache
contents and the disk contents, where we're unable to make these
consistent by undoing the page cache write for the chunk between 1 KB
and 2 KB). If we could switch the ordering and write it to the server
first and then to the page cache, then we could get rid of the clear
uptodates, but to switch this ordering requires a bigger change where
we'd need to add support for copying out data from a userspace iter to
the server (currently, only copying out data from folios are
supported). I'm happy to work on this though if you think we should
try our best to fully eradicate folio_clear_uptodate() from fuse.

There's also another folio_clear_uptodate() call in
fuse_try_move_folio() in fuse/dev.c when the server gifts pages to the
kernel through vmsplice. This one I think is needed else
folio_end_read() will xor uptodate state of an already uptodate folio
(commit 76a51ac ("fuse: clear PG_uptodate when using a stolen page")
says a bit more about this).

> [snip]
>
> Here's the log of a run with the change you've provided applied: https://qa-reports.linaro.org/lkft/sashal-linus-next/build/v6.18-rc7-13807-g26a15474eb13/testrun/30620754/suite/log-parser-test/test/exception-warning-fsiomapbuffered-io-at-ifs_free/log

Hmm, I think this WARN_ON_ONCE is getting triggered from the
folio_mark_uptodate() call in fuse_fill_write_pages().

This is happening because iomap integration hasn't (yet) been added to
the fuse writethrough path, as it's not necessary / urgent (whereas
for buffered writes, it is in order for fuse to use large folios). imo
updating the folio uptodate/dirty state but not the bitmap is
logically fine as the worst outcome from this is that we miss being
able to skip some extra read calls that we could saved if we did add
the iomap bitmap integration. However, I didn't realize there's a
WARN_ON_ONCE checking the ifs uptodate bitmap state (but curiously no
WARN_ON_ONCE checking the ifs dirty bitmap state).

With that said, I think it makes sense to either a) do the
iomap_set_range_uptodate() / iomap_clear_folio_uptodate() bitmap
updating you proposed as a fix for this WARN_ON_ONCE for now to
unblock things, until iomap integration gets added to the fuse
writethrough path, which I'll now prioritize, or b) remove that
warning. The warning does seem otherwise useful though so it seems
like we should probably just go with a).

Thanks,
Joanne

>
> --
> Thanks,
> Sasha

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 1/1] iomap: fix race between iomap_set_range_uptodate and folio_end_read
  2025-12-24  1:31         ` Sasha Levin
@ 2026-02-07  7:16           ` Wei Gao
  2026-02-09 19:08             ` Joanne Koong
  0 siblings, 1 reply; 50+ messages in thread
From: Wei Gao @ 2026-02-07  7:16 UTC (permalink / raw)
  To: Sasha Levin; +Cc: Joanne Koong, willy, linux-fsdevel, linux-kernel, wegao

On Tue, Dec 23, 2025 at 08:31:57PM -0500, Sasha Levin wrote:
> On Tue, Dec 23, 2025 at 05:12:09PM -0800, Joanne Koong wrote:
> > On Tue, Dec 23, 2025 at 2:30 PM Sasha Levin <sashal@kernel.org> wrote:
> > > 
> > 
> > Hi Sasha,
> > 
> > Thanks for your patch and for the detailed writeup.
> 
> Thanks for looking into this!
> 
> > > When iomap uses large folios, per-block uptodate tracking is managed via
> > > iomap_folio_state (ifs). A race condition can cause the ifs uptodate bits
> > > to become inconsistent with the folio's uptodate flag.
> > > 
> > > The race occurs because folio_end_read() uses XOR semantics to atomically
> > > set the uptodate bit and clear the locked bit:
> > > 
> > >   Thread A (read completion):          Thread B (concurrent write):
> > >   --------------------------------     --------------------------------
> > >   iomap_finish_folio_read()
> > >     spin_lock(state_lock)
> > >     ifs_set_range_uptodate() -> true
> > >     spin_unlock(state_lock)
> > >                                        iomap_set_range_uptodate()
> > >                                          spin_lock(state_lock)
> > >                                          ifs_set_range_uptodate() -> true
> > >                                          spin_unlock(state_lock)
> > >                                          folio_mark_uptodate(folio)
> > >     folio_end_read(folio, true)
> > >       folio_xor_flags()  // XOR CLEARS uptodate!
> > 
> > The part I'm confused about here is how this can happen between a
> > concurrent read and write. My understanding is that the folio is
> > locked when the read occurs and locked when the write occurs and both
> > locks get dropped only when the read or write finishes. Looking at
> > iomap code, I see iomap_set_range_uptodate() getting called in
> > __iomap_write_begin() and __iomap_write_end() for the writes, but in
> > both those places the folio lock is held while this is called. I'm not
> > seeing how the read and write race in the diagram can happen, but
> > maybe I'm missing something here?
> 
> Hmm, you're right... The folio lock should prevent concurrent read/write
> access. Looking at this again, I suspect that FUSE was calling
> folio_clear_uptodate() and folio_mark_uptodate() directly without updating the
> ifs bits. For example, in fuse_send_write_pages() on write error, it calls
> folio_clear_uptodate(folio) which clears the folio flag but leaves ifs still
> showing all blocks uptodate?

Hi Sasha
On PowerPC with 64KB page size, msync04 fails with SIGBUS on NTFS-FUSE. The issue stems from a state inconsistency between
the iomap_folio_state (ifs) bitmap and the folio's Uptodate flag.
tst_test.c:1985: TINFO: === Testing on ntfs ===
tst_test.c:1290: TINFO: Formatting /dev/loop0 with ntfs opts='' extra opts=''
Failed to set locale, using default 'C'.
The partition start sector was not specified for /dev/loop0 and it could not be obtained automatically.  It has been set to 0.
The number of sectors per track was not specified for /dev/loop0 and it could not be obtained automatically.  It has been set to 0.
The number of heads was not specified for /dev/loop0 and it could not be obtained automatically.  It has been set to 0.
To boot from a device, Windows needs the 'partition start sector', the 'sectors per track' and the 'number of heads' to be set.
Windows will not be able to boot from this device.
tst_test.c:1302: TINFO: Mounting /dev/loop0 to /tmp/LTP_msy3ljVxi/msync04 fstyp=ntfs flags=0
tst_test.c:1302: TINFO: Trying FUSE...
tst_test.c:1953: TBROK: Test killed by SIGBUS!

Root Cause Analysis: When a page fault triggers fuse_read_folio, the iomap_read_folio_iter handles the request. For a 64KB page, 
after fetching 4KB via fuse_iomap_read_folio_range_async, the remaining 60KB (61440 bytes) is zero-filled via iomap_block_needs_zeroing, 
then iomap_set_range_uptodate marks the folio as Uptodate globally, after folio_xor_flags folio's uptodate become 0 again, finally trigger 
an SIGBUS issue in filemap_fault.

So your iomap_set_range_uptodate patch can fix above failed case since it block mark folio's uptodate to 1.
Hope my findings are helpful.

> 
> -- 
> Thanks,
> Sasha
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 1/1] iomap: fix race between iomap_set_range_uptodate and folio_end_read
  2026-02-07  7:16           ` Wei Gao
@ 2026-02-09 19:08             ` Joanne Koong
  2026-02-10  0:12               ` Wei Gao
  0 siblings, 1 reply; 50+ messages in thread
From: Joanne Koong @ 2026-02-09 19:08 UTC (permalink / raw)
  To: Wei Gao; +Cc: Sasha Levin, willy, linux-fsdevel, linux-kernel

On Fri, Feb 6, 2026 at 11:16 PM Wei Gao <wegao@suse.com> wrote:
>
> On Tue, Dec 23, 2025 at 08:31:57PM -0500, Sasha Levin wrote:
> > On Tue, Dec 23, 2025 at 05:12:09PM -0800, Joanne Koong wrote:
> > > On Tue, Dec 23, 2025 at 2:30 PM Sasha Levin <sashal@kernel.org> wrote:
> > > >
> > >
> > > Hi Sasha,
> > >
> > > Thanks for your patch and for the detailed writeup.
> >
> > Thanks for looking into this!
> >
> > > > When iomap uses large folios, per-block uptodate tracking is managed via
> > > > iomap_folio_state (ifs). A race condition can cause the ifs uptodate bits
> > > > to become inconsistent with the folio's uptodate flag.
> > > >
> > > > The race occurs because folio_end_read() uses XOR semantics to atomically
> > > > set the uptodate bit and clear the locked bit:
> > > >
> > > >   Thread A (read completion):          Thread B (concurrent write):
> > > >   --------------------------------     --------------------------------
> > > >   iomap_finish_folio_read()
> > > >     spin_lock(state_lock)
> > > >     ifs_set_range_uptodate() -> true
> > > >     spin_unlock(state_lock)
> > > >                                        iomap_set_range_uptodate()
> > > >                                          spin_lock(state_lock)
> > > >                                          ifs_set_range_uptodate() -> true
> > > >                                          spin_unlock(state_lock)
> > > >                                          folio_mark_uptodate(folio)
> > > >     folio_end_read(folio, true)
> > > >       folio_xor_flags()  // XOR CLEARS uptodate!
> > >
> > > The part I'm confused about here is how this can happen between a
> > > concurrent read and write. My understanding is that the folio is
> > > locked when the read occurs and locked when the write occurs and both
> > > locks get dropped only when the read or write finishes. Looking at
> > > iomap code, I see iomap_set_range_uptodate() getting called in
> > > __iomap_write_begin() and __iomap_write_end() for the writes, but in
> > > both those places the folio lock is held while this is called. I'm not
> > > seeing how the read and write race in the diagram can happen, but
> > > maybe I'm missing something here?
> >
> > Hmm, you're right... The folio lock should prevent concurrent read/write
> > access. Looking at this again, I suspect that FUSE was calling
> > folio_clear_uptodate() and folio_mark_uptodate() directly without updating the
> > ifs bits. For example, in fuse_send_write_pages() on write error, it calls
> > folio_clear_uptodate(folio) which clears the folio flag but leaves ifs still
> > showing all blocks uptodate?
>
> Hi Sasha
> On PowerPC with 64KB page size, msync04 fails with SIGBUS on NTFS-FUSE. The issue stems from a state inconsistency between
> the iomap_folio_state (ifs) bitmap and the folio's Uptodate flag.
> tst_test.c:1985: TINFO: === Testing on ntfs ===
> tst_test.c:1290: TINFO: Formatting /dev/loop0 with ntfs opts='' extra opts=''
> Failed to set locale, using default 'C'.
> The partition start sector was not specified for /dev/loop0 and it could not be obtained automatically.  It has been set to 0.
> The number of sectors per track was not specified for /dev/loop0 and it could not be obtained automatically.  It has been set to 0.
> The number of heads was not specified for /dev/loop0 and it could not be obtained automatically.  It has been set to 0.
> To boot from a device, Windows needs the 'partition start sector', the 'sectors per track' and the 'number of heads' to be set.
> Windows will not be able to boot from this device.
> tst_test.c:1302: TINFO: Mounting /dev/loop0 to /tmp/LTP_msy3ljVxi/msync04 fstyp=ntfs flags=0
> tst_test.c:1302: TINFO: Trying FUSE...
> tst_test.c:1953: TBROK: Test killed by SIGBUS!
>
> Root Cause Analysis: When a page fault triggers fuse_read_folio, the iomap_read_folio_iter handles the request. For a 64KB page,
> after fetching 4KB via fuse_iomap_read_folio_range_async, the remaining 60KB (61440 bytes) is zero-filled via iomap_block_needs_zeroing,
> then iomap_set_range_uptodate marks the folio as Uptodate globally, after folio_xor_flags folio's uptodate become 0 again, finally trigger
> an SIGBUS issue in filemap_fault.

Hi Wei,

Thanks for your report. afaict, this scenario occurs only if the
server is a fuseblk server with a block size different from the memory
page size and if the file size is less than the size of the folio
being read in.

Could you verify that this snippet from Sasha's patch fixes the issue?:

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index e5c1ca440d93..7ceda24cf6a7 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -87,12 +86,50 @@ static void iomap_set_range_uptodate(struct folio
*folio, size_t off,
  if (ifs) {
          spin_lock_irqsave(&ifs->state_lock, flags);
          uptodate = ifs_set_range_uptodate(folio, ifs, off, len);
          + /*
          + * If a read is in progress, we must NOT call folio_mark_uptodate
          + * here. The read completion path (iomap_finish_folio_read or
          + * iomap_read_end) will call folio_end_read() which uses XOR
          + * semantics to set the uptodate bit. If we set it here, the XOR
          + * in folio_end_read() will clear it, leaving the folio not
          + * uptodate while the ifs says all blocks are uptodate.
          + */
         + if (uptodate && ifs->read_bytes_pending)
                   + uptodate = false;
        spin_unlock_irqrestore(&ifs->state_lock, flags);
  }

Thanks,
Joanne

>
> So your iomap_set_range_uptodate patch can fix above failed case since it block mark folio's uptodate to 1.
> Hope my findings are helpful.
>
> >
> > --
> > Thanks,
> > Sasha
> >

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 1/1] iomap: fix race between iomap_set_range_uptodate and folio_end_read
  2026-02-09 19:08             ` Joanne Koong
@ 2026-02-10  0:12               ` Wei Gao
  2026-02-10  0:20                 ` Joanne Koong
  0 siblings, 1 reply; 50+ messages in thread
From: Wei Gao @ 2026-02-10  0:12 UTC (permalink / raw)
  To: Joanne Koong; +Cc: Sasha Levin, willy, linux-fsdevel, linux-kernel

On Mon, Feb 09, 2026 at 11:08:50AM -0800, Joanne Koong wrote:
> On Fri, Feb 6, 2026 at 11:16 PM Wei Gao <wegao@suse.com> wrote:
> >
> > On Tue, Dec 23, 2025 at 08:31:57PM -0500, Sasha Levin wrote:
> > > On Tue, Dec 23, 2025 at 05:12:09PM -0800, Joanne Koong wrote:
> > > > On Tue, Dec 23, 2025 at 2:30 PM Sasha Levin <sashal@kernel.org> wrote:
> > > > >
> > > >
> > > > Hi Sasha,
> > > >
> > > > Thanks for your patch and for the detailed writeup.
> > >
> > > Thanks for looking into this!
> > >
> > > > > When iomap uses large folios, per-block uptodate tracking is managed via
> > > > > iomap_folio_state (ifs). A race condition can cause the ifs uptodate bits
> > > > > to become inconsistent with the folio's uptodate flag.
> > > > >
> > > > > The race occurs because folio_end_read() uses XOR semantics to atomically
> > > > > set the uptodate bit and clear the locked bit:
> > > > >
> > > > >   Thread A (read completion):          Thread B (concurrent write):
> > > > >   --------------------------------     --------------------------------
> > > > >   iomap_finish_folio_read()
> > > > >     spin_lock(state_lock)
> > > > >     ifs_set_range_uptodate() -> true
> > > > >     spin_unlock(state_lock)
> > > > >                                        iomap_set_range_uptodate()
> > > > >                                          spin_lock(state_lock)
> > > > >                                          ifs_set_range_uptodate() -> true
> > > > >                                          spin_unlock(state_lock)
> > > > >                                          folio_mark_uptodate(folio)
> > > > >     folio_end_read(folio, true)
> > > > >       folio_xor_flags()  // XOR CLEARS uptodate!
> > > >
> > > > The part I'm confused about here is how this can happen between a
> > > > concurrent read and write. My understanding is that the folio is
> > > > locked when the read occurs and locked when the write occurs and both
> > > > locks get dropped only when the read or write finishes. Looking at
> > > > iomap code, I see iomap_set_range_uptodate() getting called in
> > > > __iomap_write_begin() and __iomap_write_end() for the writes, but in
> > > > both those places the folio lock is held while this is called. I'm not
> > > > seeing how the read and write race in the diagram can happen, but
> > > > maybe I'm missing something here?
> > >
> > > Hmm, you're right... The folio lock should prevent concurrent read/write
> > > access. Looking at this again, I suspect that FUSE was calling
> > > folio_clear_uptodate() and folio_mark_uptodate() directly without updating the
> > > ifs bits. For example, in fuse_send_write_pages() on write error, it calls
> > > folio_clear_uptodate(folio) which clears the folio flag but leaves ifs still
> > > showing all blocks uptodate?
> >
> > Hi Sasha
> > On PowerPC with 64KB page size, msync04 fails with SIGBUS on NTFS-FUSE. The issue stems from a state inconsistency between
> > the iomap_folio_state (ifs) bitmap and the folio's Uptodate flag.
> > tst_test.c:1985: TINFO: === Testing on ntfs ===
> > tst_test.c:1290: TINFO: Formatting /dev/loop0 with ntfs opts='' extra opts=''
> > Failed to set locale, using default 'C'.
> > The partition start sector was not specified for /dev/loop0 and it could not be obtained automatically.  It has been set to 0.
> > The number of sectors per track was not specified for /dev/loop0 and it could not be obtained automatically.  It has been set to 0.
> > The number of heads was not specified for /dev/loop0 and it could not be obtained automatically.  It has been set to 0.
> > To boot from a device, Windows needs the 'partition start sector', the 'sectors per track' and the 'number of heads' to be set.
> > Windows will not be able to boot from this device.
> > tst_test.c:1302: TINFO: Mounting /dev/loop0 to /tmp/LTP_msy3ljVxi/msync04 fstyp=ntfs flags=0
> > tst_test.c:1302: TINFO: Trying FUSE...
> > tst_test.c:1953: TBROK: Test killed by SIGBUS!
> >
> > Root Cause Analysis: When a page fault triggers fuse_read_folio, the iomap_read_folio_iter handles the request. For a 64KB page,
> > after fetching 4KB via fuse_iomap_read_folio_range_async, the remaining 60KB (61440 bytes) is zero-filled via iomap_block_needs_zeroing,
> > then iomap_set_range_uptodate marks the folio as Uptodate globally, after folio_xor_flags folio's uptodate become 0 again, finally trigger
> > an SIGBUS issue in filemap_fault.
> 
> Hi Wei,
> 
> Thanks for your report. afaict, this scenario occurs only if the
> server is a fuseblk server with a block size different from the memory
> page size and if the file size is less than the size of the folio
> being read in.
Thanks for checking this and give quick feedback :)
> 
> Could you verify that this snippet from Sasha's patch fixes the issue?:
Yes, Sasha's patch can fixes the issue.
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index e5c1ca440d93..7ceda24cf6a7 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -87,12 +86,50 @@ static void iomap_set_range_uptodate(struct folio
> *folio, size_t off,
>   if (ifs) {
>           spin_lock_irqsave(&ifs->state_lock, flags);
>           uptodate = ifs_set_range_uptodate(folio, ifs, off, len);
>           + /*
>           + * If a read is in progress, we must NOT call folio_mark_uptodate
>           + * here. The read completion path (iomap_finish_folio_read or
>           + * iomap_read_end) will call folio_end_read() which uses XOR
>           + * semantics to set the uptodate bit. If we set it here, the XOR
>           + * in folio_end_read() will clear it, leaving the folio not
>           + * uptodate while the ifs says all blocks are uptodate.
>           + */
>          + if (uptodate && ifs->read_bytes_pending)
>                    + uptodate = false;
>         spin_unlock_irqrestore(&ifs->state_lock, flags);
>   }
> 
> Thanks,
> Joanne
> 
> >
> > So your iomap_set_range_uptodate patch can fix above failed case since it block mark folio's uptodate to 1.
> > Hope my findings are helpful.
> >
> > >
> > > --
> > > Thanks,
> > > Sasha
> > >

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 1/1] iomap: fix race between iomap_set_range_uptodate and folio_end_read
  2026-02-10  0:12               ` Wei Gao
@ 2026-02-10  0:20                 ` Joanne Koong
  2026-02-10  0:40                   ` Wei Gao
  0 siblings, 1 reply; 50+ messages in thread
From: Joanne Koong @ 2026-02-10  0:20 UTC (permalink / raw)
  To: Wei Gao; +Cc: Sasha Levin, willy, linux-fsdevel, linux-kernel

On Mon, Feb 9, 2026 at 4:12 PM Wei Gao <wegao@suse.com> wrote:
>
> On Mon, Feb 09, 2026 at 11:08:50AM -0800, Joanne Koong wrote:
> > On Fri, Feb 6, 2026 at 11:16 PM Wei Gao <wegao@suse.com> wrote:
> > >
> > > On Tue, Dec 23, 2025 at 08:31:57PM -0500, Sasha Levin wrote:
> > > > On Tue, Dec 23, 2025 at 05:12:09PM -0800, Joanne Koong wrote:
> > > > > On Tue, Dec 23, 2025 at 2:30 PM Sasha Levin <sashal@kernel.org> wrote:
> > > > > >
> > > > >
> > > > > Hi Sasha,
> > > > >
> > > > > Thanks for your patch and for the detailed writeup.
> > > >
> > > > Thanks for looking into this!
> > > >
> > > > > > When iomap uses large folios, per-block uptodate tracking is managed via
> > > > > > iomap_folio_state (ifs). A race condition can cause the ifs uptodate bits
> > > > > > to become inconsistent with the folio's uptodate flag.
> > > > > >
> > > > > > The race occurs because folio_end_read() uses XOR semantics to atomically
> > > > > > set the uptodate bit and clear the locked bit:
> > > > > >
> > > > > >   Thread A (read completion):          Thread B (concurrent write):
> > > > > >   --------------------------------     --------------------------------
> > > > > >   iomap_finish_folio_read()
> > > > > >     spin_lock(state_lock)
> > > > > >     ifs_set_range_uptodate() -> true
> > > > > >     spin_unlock(state_lock)
> > > > > >                                        iomap_set_range_uptodate()
> > > > > >                                          spin_lock(state_lock)
> > > > > >                                          ifs_set_range_uptodate() -> true
> > > > > >                                          spin_unlock(state_lock)
> > > > > >                                          folio_mark_uptodate(folio)
> > > > > >     folio_end_read(folio, true)
> > > > > >       folio_xor_flags()  // XOR CLEARS uptodate!
> > > > >
> > > > > The part I'm confused about here is how this can happen between a
> > > > > concurrent read and write. My understanding is that the folio is
> > > > > locked when the read occurs and locked when the write occurs and both
> > > > > locks get dropped only when the read or write finishes. Looking at
> > > > > iomap code, I see iomap_set_range_uptodate() getting called in
> > > > > __iomap_write_begin() and __iomap_write_end() for the writes, but in
> > > > > both those places the folio lock is held while this is called. I'm not
> > > > > seeing how the read and write race in the diagram can happen, but
> > > > > maybe I'm missing something here?
> > > >
> > > > Hmm, you're right... The folio lock should prevent concurrent read/write
> > > > access. Looking at this again, I suspect that FUSE was calling
> > > > folio_clear_uptodate() and folio_mark_uptodate() directly without updating the
> > > > ifs bits. For example, in fuse_send_write_pages() on write error, it calls
> > > > folio_clear_uptodate(folio) which clears the folio flag but leaves ifs still
> > > > showing all blocks uptodate?
> > >
> > > Hi Sasha
> > > On PowerPC with 64KB page size, msync04 fails with SIGBUS on NTFS-FUSE. The issue stems from a state inconsistency between
> > > the iomap_folio_state (ifs) bitmap and the folio's Uptodate flag.
> > > tst_test.c:1985: TINFO: === Testing on ntfs ===
> > > tst_test.c:1290: TINFO: Formatting /dev/loop0 with ntfs opts='' extra opts=''
> > > Failed to set locale, using default 'C'.
> > > The partition start sector was not specified for /dev/loop0 and it could not be obtained automatically.  It has been set to 0.
> > > The number of sectors per track was not specified for /dev/loop0 and it could not be obtained automatically.  It has been set to 0.
> > > The number of heads was not specified for /dev/loop0 and it could not be obtained automatically.  It has been set to 0.
> > > To boot from a device, Windows needs the 'partition start sector', the 'sectors per track' and the 'number of heads' to be set.
> > > Windows will not be able to boot from this device.
> > > tst_test.c:1302: TINFO: Mounting /dev/loop0 to /tmp/LTP_msy3ljVxi/msync04 fstyp=ntfs flags=0
> > > tst_test.c:1302: TINFO: Trying FUSE...
> > > tst_test.c:1953: TBROK: Test killed by SIGBUS!
> > >
> > > Root Cause Analysis: When a page fault triggers fuse_read_folio, the iomap_read_folio_iter handles the request. For a 64KB page,
> > > after fetching 4KB via fuse_iomap_read_folio_range_async, the remaining 60KB (61440 bytes) is zero-filled via iomap_block_needs_zeroing,
> > > then iomap_set_range_uptodate marks the folio as Uptodate globally, after folio_xor_flags folio's uptodate become 0 again, finally trigger
> > > an SIGBUS issue in filemap_fault.
> >
> > Hi Wei,
> >
> > Thanks for your report. afaict, this scenario occurs only if the
> > server is a fuseblk server with a block size different from the memory
> > page size and if the file size is less than the size of the folio
> > being read in.
> Thanks for checking this and give quick feedback :)
> >
> > Could you verify that this snippet from Sasha's patch fixes the issue?:
> Yes, Sasha's patch can fixes the issue.

I think just those lines I pasted from Sasha's patch is the relevant
fix. Could you verify that just those lines (without the changes
from the rest of his patch) fixes the issue?

Thanks,
Joanne


> >
> > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > index e5c1ca440d93..7ceda24cf6a7 100644
> > --- a/fs/iomap/buffered-io.c
> > +++ b/fs/iomap/buffered-io.c
> > @@ -87,12 +86,50 @@ static void iomap_set_range_uptodate(struct folio
> > *folio, size_t off,
> >   if (ifs) {
> >           spin_lock_irqsave(&ifs->state_lock, flags);
> >           uptodate = ifs_set_range_uptodate(folio, ifs, off, len);
> >           + /*
> >           + * If a read is in progress, we must NOT call folio_mark_uptodate
> >           + * here. The read completion path (iomap_finish_folio_read or
> >           + * iomap_read_end) will call folio_end_read() which uses XOR
> >           + * semantics to set the uptodate bit. If we set it here, the XOR
> >           + * in folio_end_read() will clear it, leaving the folio not
> >           + * uptodate while the ifs says all blocks are uptodate.
> >           + */
> >          + if (uptodate && ifs->read_bytes_pending)
> >                    + uptodate = false;
> >         spin_unlock_irqrestore(&ifs->state_lock, flags);
> >   }
> >
> > Thanks,
> > Joanne
> >
> > >
> > > So your iomap_set_range_uptodate patch can fix above failed case since it block mark folio's uptodate to 1.
> > > Hope my findings are helpful.
> > >
> > > >
> > > > --
> > > > Thanks,
> > > > Sasha
> > > >

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 1/1] iomap: fix race between iomap_set_range_uptodate and folio_end_read
  2026-02-10  0:20                 ` Joanne Koong
@ 2026-02-10  0:40                   ` Wei Gao
  2026-02-10 22:18                     ` Joanne Koong
  0 siblings, 1 reply; 50+ messages in thread
From: Wei Gao @ 2026-02-10  0:40 UTC (permalink / raw)
  To: Joanne Koong; +Cc: Sasha Levin, willy, linux-fsdevel, linux-kernel

On Mon, Feb 09, 2026 at 04:20:01PM -0800, Joanne Koong wrote:
> On Mon, Feb 9, 2026 at 4:12 PM Wei Gao <wegao@suse.com> wrote:
> >
> > On Mon, Feb 09, 2026 at 11:08:50AM -0800, Joanne Koong wrote:
> > > On Fri, Feb 6, 2026 at 11:16 PM Wei Gao <wegao@suse.com> wrote:
> > > >
> > > > On Tue, Dec 23, 2025 at 08:31:57PM -0500, Sasha Levin wrote:
> > > > > On Tue, Dec 23, 2025 at 05:12:09PM -0800, Joanne Koong wrote:
> > > > > > On Tue, Dec 23, 2025 at 2:30 PM Sasha Levin <sashal@kernel.org> wrote:
> > > > > > >
> > > > > >
> > > > > > Hi Sasha,
> > > > > >
> > > > > > Thanks for your patch and for the detailed writeup.
> > > > >
> > > > > Thanks for looking into this!
> > > > >
> > > > > > > When iomap uses large folios, per-block uptodate tracking is managed via
> > > > > > > iomap_folio_state (ifs). A race condition can cause the ifs uptodate bits
> > > > > > > to become inconsistent with the folio's uptodate flag.
> > > > > > >
> > > > > > > The race occurs because folio_end_read() uses XOR semantics to atomically
> > > > > > > set the uptodate bit and clear the locked bit:
> > > > > > >
> > > > > > >   Thread A (read completion):          Thread B (concurrent write):
> > > > > > >   --------------------------------     --------------------------------
> > > > > > >   iomap_finish_folio_read()
> > > > > > >     spin_lock(state_lock)
> > > > > > >     ifs_set_range_uptodate() -> true
> > > > > > >     spin_unlock(state_lock)
> > > > > > >                                        iomap_set_range_uptodate()
> > > > > > >                                          spin_lock(state_lock)
> > > > > > >                                          ifs_set_range_uptodate() -> true
> > > > > > >                                          spin_unlock(state_lock)
> > > > > > >                                          folio_mark_uptodate(folio)
> > > > > > >     folio_end_read(folio, true)
> > > > > > >       folio_xor_flags()  // XOR CLEARS uptodate!
> > > > > >
> > > > > > The part I'm confused about here is how this can happen between a
> > > > > > concurrent read and write. My understanding is that the folio is
> > > > > > locked when the read occurs and locked when the write occurs and both
> > > > > > locks get dropped only when the read or write finishes. Looking at
> > > > > > iomap code, I see iomap_set_range_uptodate() getting called in
> > > > > > __iomap_write_begin() and __iomap_write_end() for the writes, but in
> > > > > > both those places the folio lock is held while this is called. I'm not
> > > > > > seeing how the read and write race in the diagram can happen, but
> > > > > > maybe I'm missing something here?
> > > > >
> > > > > Hmm, you're right... The folio lock should prevent concurrent read/write
> > > > > access. Looking at this again, I suspect that FUSE was calling
> > > > > folio_clear_uptodate() and folio_mark_uptodate() directly without updating the
> > > > > ifs bits. For example, in fuse_send_write_pages() on write error, it calls
> > > > > folio_clear_uptodate(folio) which clears the folio flag but leaves ifs still
> > > > > showing all blocks uptodate?
> > > >
> > > > Hi Sasha
> > > > On PowerPC with 64KB page size, msync04 fails with SIGBUS on NTFS-FUSE. The issue stems from a state inconsistency between
> > > > the iomap_folio_state (ifs) bitmap and the folio's Uptodate flag.
> > > > tst_test.c:1985: TINFO: === Testing on ntfs ===
> > > > tst_test.c:1290: TINFO: Formatting /dev/loop0 with ntfs opts='' extra opts=''
> > > > Failed to set locale, using default 'C'.
> > > > The partition start sector was not specified for /dev/loop0 and it could not be obtained automatically.  It has been set to 0.
> > > > The number of sectors per track was not specified for /dev/loop0 and it could not be obtained automatically.  It has been set to 0.
> > > > The number of heads was not specified for /dev/loop0 and it could not be obtained automatically.  It has been set to 0.
> > > > To boot from a device, Windows needs the 'partition start sector', the 'sectors per track' and the 'number of heads' to be set.
> > > > Windows will not be able to boot from this device.
> > > > tst_test.c:1302: TINFO: Mounting /dev/loop0 to /tmp/LTP_msy3ljVxi/msync04 fstyp=ntfs flags=0
> > > > tst_test.c:1302: TINFO: Trying FUSE...
> > > > tst_test.c:1953: TBROK: Test killed by SIGBUS!
> > > >
> > > > Root Cause Analysis: When a page fault triggers fuse_read_folio, the iomap_read_folio_iter handles the request. For a 64KB page,
> > > > after fetching 4KB via fuse_iomap_read_folio_range_async, the remaining 60KB (61440 bytes) is zero-filled via iomap_block_needs_zeroing,
> > > > then iomap_set_range_uptodate marks the folio as Uptodate globally, after folio_xor_flags folio's uptodate become 0 again, finally trigger
> > > > an SIGBUS issue in filemap_fault.
> > >
> > > Hi Wei,
> > >
> > > Thanks for your report. afaict, this scenario occurs only if the
> > > server is a fuseblk server with a block size different from the memory
> > > page size and if the file size is less than the size of the folio
> > > being read in.
> > Thanks for checking this and give quick feedback :)
> > >
> > > Could you verify that this snippet from Sasha's patch fixes the issue?:
> > Yes, Sasha's patch can fixes the issue.
> 
> I think just those lines I pasted from Sasha's patch is the relevant
> fix. Could you verify that just those lines (without the changes
> from the rest of his patch) fixes the issue?
Yes, i just add two lines change in iomap_set_range_uptodate can fixes
the issue.
+		if (uptodate && ifs->read_bytes_pending)
+			uptodate = false;
> 
> Thanks,
> Joanne
> 
> 
> > >
> > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > > index e5c1ca440d93..7ceda24cf6a7 100644
> > > --- a/fs/iomap/buffered-io.c
> > > +++ b/fs/iomap/buffered-io.c
> > > @@ -87,12 +86,50 @@ static void iomap_set_range_uptodate(struct folio
> > > *folio, size_t off,
> > >   if (ifs) {
> > >           spin_lock_irqsave(&ifs->state_lock, flags);
> > >           uptodate = ifs_set_range_uptodate(folio, ifs, off, len);
> > >           + /*
> > >           + * If a read is in progress, we must NOT call folio_mark_uptodate
> > >           + * here. The read completion path (iomap_finish_folio_read or
> > >           + * iomap_read_end) will call folio_end_read() which uses XOR
> > >           + * semantics to set the uptodate bit. If we set it here, the XOR
> > >           + * in folio_end_read() will clear it, leaving the folio not
> > >           + * uptodate while the ifs says all blocks are uptodate.
> > >           + */
> > >          + if (uptodate && ifs->read_bytes_pending)
> > >                    + uptodate = false;
> > >         spin_unlock_irqrestore(&ifs->state_lock, flags);
> > >   }
> > >
> > > Thanks,
> > > Joanne
> > >
> > > >
> > > > So your iomap_set_range_uptodate patch can fix above failed case since it block mark folio's uptodate to 1.
> > > > Hope my findings are helpful.
> > > >
> > > > >
> > > > > --
> > > > > Thanks,
> > > > > Sasha
> > > > >

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 1/1] iomap: fix race between iomap_set_range_uptodate and folio_end_read
  2026-02-10  0:40                   ` Wei Gao
@ 2026-02-10 22:18                     ` Joanne Koong
  2026-02-11  0:00                       ` Sasha Levin
  2026-02-11  3:11                       ` Matthew Wilcox
  0 siblings, 2 replies; 50+ messages in thread
From: Joanne Koong @ 2026-02-10 22:18 UTC (permalink / raw)
  To: Wei Gao; +Cc: Sasha Levin, willy, linux-fsdevel, linux-kernel

On Mon, Feb 9, 2026 at 4:40 PM Wei Gao <wegao@suse.com> wrote:
>
> On Mon, Feb 09, 2026 at 04:20:01PM -0800, Joanne Koong wrote:
> > On Mon, Feb 9, 2026 at 4:12 PM Wei Gao <wegao@suse.com> wrote:
> > >
> > > On Mon, Feb 09, 2026 at 11:08:50AM -0800, Joanne Koong wrote:
> > > > On Fri, Feb 6, 2026 at 11:16 PM Wei Gao <wegao@suse.com> wrote:
> > > > >
> > > > > On Tue, Dec 23, 2025 at 08:31:57PM -0500, Sasha Levin wrote:
> > > > > > On Tue, Dec 23, 2025 at 05:12:09PM -0800, Joanne Koong wrote:
> > > > > > > On Tue, Dec 23, 2025 at 2:30 PM Sasha Levin <sashal@kernel.org> wrote:
> > > > > > > >
> > > > > > >
> > > > > > > Hi Sasha,
> > > > > > >
> > > > > > > Thanks for your patch and for the detailed writeup.
> > > > > >
> > > > > > Thanks for looking into this!
> > > > > >
> > > > > > > > When iomap uses large folios, per-block uptodate tracking is managed via
> > > > > > > > iomap_folio_state (ifs). A race condition can cause the ifs uptodate bits
> > > > > > > > to become inconsistent with the folio's uptodate flag.
> > > > > > > >
> > > > > > > > The race occurs because folio_end_read() uses XOR semantics to atomically
> > > > > > > > set the uptodate bit and clear the locked bit:
> > > > > > > >
> > > > > > > >   Thread A (read completion):          Thread B (concurrent write):
> > > > > > > >   --------------------------------     --------------------------------
> > > > > > > >   iomap_finish_folio_read()
> > > > > > > >     spin_lock(state_lock)
> > > > > > > >     ifs_set_range_uptodate() -> true
> > > > > > > >     spin_unlock(state_lock)
> > > > > > > >                                        iomap_set_range_uptodate()
> > > > > > > >                                          spin_lock(state_lock)
> > > > > > > >                                          ifs_set_range_uptodate() -> true
> > > > > > > >                                          spin_unlock(state_lock)
> > > > > > > >                                          folio_mark_uptodate(folio)
> > > > > > > >     folio_end_read(folio, true)
> > > > > > > >       folio_xor_flags()  // XOR CLEARS uptodate!
> > > > > > >
> > > > > > > The part I'm confused about here is how this can happen between a
> > > > > > > concurrent read and write. My understanding is that the folio is
> > > > > > > locked when the read occurs and locked when the write occurs and both
> > > > > > > locks get dropped only when the read or write finishes. Looking at
> > > > > > > iomap code, I see iomap_set_range_uptodate() getting called in
> > > > > > > __iomap_write_begin() and __iomap_write_end() for the writes, but in
> > > > > > > both those places the folio lock is held while this is called. I'm not
> > > > > > > seeing how the read and write race in the diagram can happen, but
> > > > > > > maybe I'm missing something here?
> > > > > >
> > > > > > Hmm, you're right... The folio lock should prevent concurrent read/write
> > > > > > access. Looking at this again, I suspect that FUSE was calling
> > > > > > folio_clear_uptodate() and folio_mark_uptodate() directly without updating the
> > > > > > ifs bits. For example, in fuse_send_write_pages() on write error, it calls
> > > > > > folio_clear_uptodate(folio) which clears the folio flag but leaves ifs still
> > > > > > showing all blocks uptodate?
> > > > >
> > > > > Hi Sasha
> > > > > On PowerPC with 64KB page size, msync04 fails with SIGBUS on NTFS-FUSE. The issue stems from a state inconsistency between
> > > > > the iomap_folio_state (ifs) bitmap and the folio's Uptodate flag.
> > > > > tst_test.c:1985: TINFO: === Testing on ntfs ===
> > > > > tst_test.c:1290: TINFO: Formatting /dev/loop0 with ntfs opts='' extra opts=''
> > > > > Failed to set locale, using default 'C'.
> > > > > The partition start sector was not specified for /dev/loop0 and it could not be obtained automatically.  It has been set to 0.
> > > > > The number of sectors per track was not specified for /dev/loop0 and it could not be obtained automatically.  It has been set to 0.
> > > > > The number of heads was not specified for /dev/loop0 and it could not be obtained automatically.  It has been set to 0.
> > > > > To boot from a device, Windows needs the 'partition start sector', the 'sectors per track' and the 'number of heads' to be set.
> > > > > Windows will not be able to boot from this device.
> > > > > tst_test.c:1302: TINFO: Mounting /dev/loop0 to /tmp/LTP_msy3ljVxi/msync04 fstyp=ntfs flags=0
> > > > > tst_test.c:1302: TINFO: Trying FUSE...
> > > > > tst_test.c:1953: TBROK: Test killed by SIGBUS!
> > > > >
> > > > > Root Cause Analysis: When a page fault triggers fuse_read_folio, the iomap_read_folio_iter handles the request. For a 64KB page,
> > > > > after fetching 4KB via fuse_iomap_read_folio_range_async, the remaining 60KB (61440 bytes) is zero-filled via iomap_block_needs_zeroing,
> > > > > then iomap_set_range_uptodate marks the folio as Uptodate globally, after folio_xor_flags folio's uptodate become 0 again, finally trigger
> > > > > an SIGBUS issue in filemap_fault.
> > > >
> > > > Hi Wei,
> > > >
> > > > Thanks for your report. afaict, this scenario occurs only if the
> > > > server is a fuseblk server with a block size different from the memory
> > > > page size and if the file size is less than the size of the folio
> > > > being read in.
> > > Thanks for checking this and give quick feedback :)
> > > >
> > > > Could you verify that this snippet from Sasha's patch fixes the issue?:
> > > Yes, Sasha's patch can fixes the issue.
> >
> > I think just those lines I pasted from Sasha's patch is the relevant
> > fix. Could you verify that just those lines (without the changes
> > from the rest of his patch) fixes the issue?
> Yes, i just add two lines change in iomap_set_range_uptodate can fixes
> the issue.

Great, thank you for confirming.

Sasha, would you mind submitting this snippet of your patch as the fix
for the EOF zeroing issue? I think it could be restructured to

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 1fe19b4ee2f4..412e661871f8 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -87,7 +87,16 @@ static void iomap_set_range_uptodate(struct folio
*folio, size_t off,

        if (ifs) {
                spin_lock_irqsave(&ifs->state_lock, flags);
-               uptodate = ifs_set_range_uptodate(folio, ifs, off, len);
+               /*
+                * If a read is in progress, we must NOT call
folio_mark_uptodate.
+                * The read completion path (iomap_finish_folio_read or
+                * iomap_read_end) will call folio_end_read() which uses XOR
+                * semantics to set the uptodate bit. If we set it here, the XOR
+                * in folio_end_read() will clear it, leaving the folio not
+                * uptodate.
+                */
+               uptodate = ifs_set_range_uptodate(folio, ifs, off, len) &&
+                       !ifs->read_bytes_pending;
                spin_unlock_irqrestore(&ifs->state_lock, flags);
        }

to be a bit more concise.

If you're busy and don't have the bandwidth, I'm happy to forward the
patch on your behalf with your Signed-off-by / authorship.

Thanks,
Joanne
> +               if (uptodate && ifs->read_bytes_pending)
> +                       uptodate = false;
> >
> > Thanks,
> > Joanne
> >
> >
> > > >
> > > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > > > index e5c1ca440d93..7ceda24cf6a7 100644
> > > > --- a/fs/iomap/buffered-io.c
> > > > +++ b/fs/iomap/buffered-io.c
> > > > @@ -87,12 +86,50 @@ static void iomap_set_range_uptodate(struct folio
> > > > *folio, size_t off,
> > > >   if (ifs) {
> > > >           spin_lock_irqsave(&ifs->state_lock, flags);
> > > >           uptodate = ifs_set_range_uptodate(folio, ifs, off, len);
> > > >           + /*
> > > >           + * If a read is in progress, we must NOT call folio_mark_uptodate
> > > >           + * here. The read completion path (iomap_finish_folio_read or
> > > >           + * iomap_read_end) will call folio_end_read() which uses XOR
> > > >           + * semantics to set the uptodate bit. If we set it here, the XOR
> > > >           + * in folio_end_read() will clear it, leaving the folio not
> > > >           + * uptodate while the ifs says all blocks are uptodate.
> > > >           + */
> > > >          + if (uptodate && ifs->read_bytes_pending)
> > > >                    + uptodate = false;
> > > >         spin_unlock_irqrestore(&ifs->state_lock, flags);
> > > >   }
> > > >
> > > > Thanks,
> > > > Joanne
> > > >
> > > > >
> > > > > So your iomap_set_range_uptodate patch can fix above failed case since it block mark folio's uptodate to 1.
> > > > > Hope my findings are helpful.
> > > > >
> > > > > >
> > > > > > --
> > > > > > Thanks,
> > > > > > Sasha
> > > > > >

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 1/1] iomap: fix race between iomap_set_range_uptodate and folio_end_read
  2026-02-10 22:18                     ` Joanne Koong
@ 2026-02-11  0:00                       ` Sasha Levin
  2026-02-11  3:11                       ` Matthew Wilcox
  1 sibling, 0 replies; 50+ messages in thread
From: Sasha Levin @ 2026-02-11  0:00 UTC (permalink / raw)
  To: Joanne Koong; +Cc: Wei Gao, willy, linux-fsdevel, linux-kernel

On Tue, Feb 10, 2026 at 02:18:06PM -0800, Joanne Koong wrote:
>On Mon, Feb 9, 2026 at 4:40 PM Wei Gao <wegao@suse.com> wrote:
>>
>> On Mon, Feb 09, 2026 at 04:20:01PM -0800, Joanne Koong wrote:
>> > On Mon, Feb 9, 2026 at 4:12 PM Wei Gao <wegao@suse.com> wrote:
>> > >
>> > > On Mon, Feb 09, 2026 at 11:08:50AM -0800, Joanne Koong wrote:
>> > > > On Fri, Feb 6, 2026 at 11:16 PM Wei Gao <wegao@suse.com> wrote:
>> > > > >
>> > > > > On Tue, Dec 23, 2025 at 08:31:57PM -0500, Sasha Levin wrote:
>> > > > > > On Tue, Dec 23, 2025 at 05:12:09PM -0800, Joanne Koong wrote:
>> > > > > > > On Tue, Dec 23, 2025 at 2:30 PM Sasha Levin <sashal@kernel.org> wrote:
>> > > > > > > >
>> > > > > > >
>> > > > > > > Hi Sasha,
>> > > > > > >
>> > > > > > > Thanks for your patch and for the detailed writeup.
>> > > > > >
>> > > > > > Thanks for looking into this!
>> > > > > >
>> > > > > > > > When iomap uses large folios, per-block uptodate tracking is managed via
>> > > > > > > > iomap_folio_state (ifs). A race condition can cause the ifs uptodate bits
>> > > > > > > > to become inconsistent with the folio's uptodate flag.
>> > > > > > > >
>> > > > > > > > The race occurs because folio_end_read() uses XOR semantics to atomically
>> > > > > > > > set the uptodate bit and clear the locked bit:
>> > > > > > > >
>> > > > > > > >   Thread A (read completion):          Thread B (concurrent write):
>> > > > > > > >   --------------------------------     --------------------------------
>> > > > > > > >   iomap_finish_folio_read()
>> > > > > > > >     spin_lock(state_lock)
>> > > > > > > >     ifs_set_range_uptodate() -> true
>> > > > > > > >     spin_unlock(state_lock)
>> > > > > > > >                                        iomap_set_range_uptodate()
>> > > > > > > >                                          spin_lock(state_lock)
>> > > > > > > >                                          ifs_set_range_uptodate() -> true
>> > > > > > > >                                          spin_unlock(state_lock)
>> > > > > > > >                                          folio_mark_uptodate(folio)
>> > > > > > > >     folio_end_read(folio, true)
>> > > > > > > >       folio_xor_flags()  // XOR CLEARS uptodate!
>> > > > > > >
>> > > > > > > The part I'm confused about here is how this can happen between a
>> > > > > > > concurrent read and write. My understanding is that the folio is
>> > > > > > > locked when the read occurs and locked when the write occurs and both
>> > > > > > > locks get dropped only when the read or write finishes. Looking at
>> > > > > > > iomap code, I see iomap_set_range_uptodate() getting called in
>> > > > > > > __iomap_write_begin() and __iomap_write_end() for the writes, but in
>> > > > > > > both those places the folio lock is held while this is called. I'm not
>> > > > > > > seeing how the read and write race in the diagram can happen, but
>> > > > > > > maybe I'm missing something here?
>> > > > > >
>> > > > > > Hmm, you're right... The folio lock should prevent concurrent read/write
>> > > > > > access. Looking at this again, I suspect that FUSE was calling
>> > > > > > folio_clear_uptodate() and folio_mark_uptodate() directly without updating the
>> > > > > > ifs bits. For example, in fuse_send_write_pages() on write error, it calls
>> > > > > > folio_clear_uptodate(folio) which clears the folio flag but leaves ifs still
>> > > > > > showing all blocks uptodate?
>> > > > >
>> > > > > Hi Sasha
>> > > > > On PowerPC with 64KB page size, msync04 fails with SIGBUS on NTFS-FUSE. The issue stems from a state inconsistency between
>> > > > > the iomap_folio_state (ifs) bitmap and the folio's Uptodate flag.
>> > > > > tst_test.c:1985: TINFO: === Testing on ntfs ===
>> > > > > tst_test.c:1290: TINFO: Formatting /dev/loop0 with ntfs opts='' extra opts=''
>> > > > > Failed to set locale, using default 'C'.
>> > > > > The partition start sector was not specified for /dev/loop0 and it could not be obtained automatically.  It has been set to 0.
>> > > > > The number of sectors per track was not specified for /dev/loop0 and it could not be obtained automatically.  It has been set to 0.
>> > > > > The number of heads was not specified for /dev/loop0 and it could not be obtained automatically.  It has been set to 0.
>> > > > > To boot from a device, Windows needs the 'partition start sector', the 'sectors per track' and the 'number of heads' to be set.
>> > > > > Windows will not be able to boot from this device.
>> > > > > tst_test.c:1302: TINFO: Mounting /dev/loop0 to /tmp/LTP_msy3ljVxi/msync04 fstyp=ntfs flags=0
>> > > > > tst_test.c:1302: TINFO: Trying FUSE...
>> > > > > tst_test.c:1953: TBROK: Test killed by SIGBUS!
>> > > > >
>> > > > > Root Cause Analysis: When a page fault triggers fuse_read_folio, the iomap_read_folio_iter handles the request. For a 64KB page,
>> > > > > after fetching 4KB via fuse_iomap_read_folio_range_async, the remaining 60KB (61440 bytes) is zero-filled via iomap_block_needs_zeroing,
>> > > > > then iomap_set_range_uptodate marks the folio as Uptodate globally, after folio_xor_flags folio's uptodate become 0 again, finally trigger
>> > > > > an SIGBUS issue in filemap_fault.
>> > > >
>> > > > Hi Wei,
>> > > >
>> > > > Thanks for your report. afaict, this scenario occurs only if the
>> > > > server is a fuseblk server with a block size different from the memory
>> > > > page size and if the file size is less than the size of the folio
>> > > > being read in.
>> > > Thanks for checking this and give quick feedback :)
>> > > >
>> > > > Could you verify that this snippet from Sasha's patch fixes the issue?:
>> > > Yes, Sasha's patch can fixes the issue.
>> >
>> > I think just those lines I pasted from Sasha's patch is the relevant
>> > fix. Could you verify that just those lines (without the changes
>> > from the rest of his patch) fixes the issue?
>> Yes, i just add two lines change in iomap_set_range_uptodate can fixes
>> the issue.
>
>Great, thank you for confirming.
>
>Sasha, would you mind submitting this snippet of your patch as the fix
>for the EOF zeroing issue? I think it could be restructured to
>
>diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
>index 1fe19b4ee2f4..412e661871f8 100644
>--- a/fs/iomap/buffered-io.c
>+++ b/fs/iomap/buffered-io.c
>@@ -87,7 +87,16 @@ static void iomap_set_range_uptodate(struct folio
>*folio, size_t off,
>
>        if (ifs) {
>                spin_lock_irqsave(&ifs->state_lock, flags);
>-               uptodate = ifs_set_range_uptodate(folio, ifs, off, len);
>+               /*
>+                * If a read is in progress, we must NOT call
>folio_mark_uptodate.
>+                * The read completion path (iomap_finish_folio_read or
>+                * iomap_read_end) will call folio_end_read() which uses XOR
>+                * semantics to set the uptodate bit. If we set it here, the XOR
>+                * in folio_end_read() will clear it, leaving the folio not
>+                * uptodate.
>+                */
>+               uptodate = ifs_set_range_uptodate(folio, ifs, off, len) &&
>+                       !ifs->read_bytes_pending;
>                spin_unlock_irqrestore(&ifs->state_lock, flags);
>        }
>
>to be a bit more concise.
>
>If you're busy and don't have the bandwidth, I'm happy to forward the
>patch on your behalf with your Signed-off-by / authorship.

Thanks for the offer Joanna!

Since you've done all the triaging work here, please go ahead and submit it -
something like a Suggested-by would be more than enought for me :)

-- 
Thanks,
Sasha

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 1/1] iomap: fix race between iomap_set_range_uptodate and folio_end_read
  2026-02-10 22:18                     ` Joanne Koong
  2026-02-11  0:00                       ` Sasha Levin
@ 2026-02-11  3:11                       ` Matthew Wilcox
  2026-02-11 19:33                         ` Joanne Koong
  1 sibling, 1 reply; 50+ messages in thread
From: Matthew Wilcox @ 2026-02-11  3:11 UTC (permalink / raw)
  To: Joanne Koong; +Cc: Wei Gao, Sasha Levin, linux-fsdevel, linux-kernel

On Tue, Feb 10, 2026 at 02:18:06PM -0800, Joanne Koong wrote:
>                 spin_lock_irqsave(&ifs->state_lock, flags);
> -               uptodate = ifs_set_range_uptodate(folio, ifs, off, len);
> +               /*
> +                * If a read is in progress, we must NOT call
> folio_mark_uptodate.
> +                * The read completion path (iomap_finish_folio_read or
> +                * iomap_read_end) will call folio_end_read() which uses XOR
> +                * semantics to set the uptodate bit. If we set it here, the XOR
> +                * in folio_end_read() will clear it, leaving the folio not
> +                * uptodate.
> +                */
> +               uptodate = ifs_set_range_uptodate(folio, ifs, off, len) &&
> +                       !ifs->read_bytes_pending;
>                 spin_unlock_irqrestore(&ifs->state_lock, flags);

This can't possibly be the right fix.  There's some horrible confusion
here.  It should not be possible to have read bytes pending _and_ the
entire folio be uptodate.  That's an invariant that should always be
maintained.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 1/1] iomap: fix race between iomap_set_range_uptodate and folio_end_read
  2026-02-11  3:11                       ` Matthew Wilcox
@ 2026-02-11 19:33                         ` Joanne Koong
  2026-02-11 21:03                           ` Matthew Wilcox
  0 siblings, 1 reply; 50+ messages in thread
From: Joanne Koong @ 2026-02-11 19:33 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Wei Gao, Sasha Levin, linux-fsdevel, linux-kernel

On Tue, Feb 10, 2026 at 7:11 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Feb 10, 2026 at 02:18:06PM -0800, Joanne Koong wrote:
> >                 spin_lock_irqsave(&ifs->state_lock, flags);
> > -               uptodate = ifs_set_range_uptodate(folio, ifs, off, len);
> > +               /*
> > +                * If a read is in progress, we must NOT call
> > folio_mark_uptodate.
> > +                * The read completion path (iomap_finish_folio_read or
> > +                * iomap_read_end) will call folio_end_read() which uses XOR
> > +                * semantics to set the uptodate bit. If we set it here, the XOR
> > +                * in folio_end_read() will clear it, leaving the folio not
> > +                * uptodate.
> > +                */
> > +               uptodate = ifs_set_range_uptodate(folio, ifs, off, len) &&
> > +                       !ifs->read_bytes_pending;
> >                 spin_unlock_irqrestore(&ifs->state_lock, flags);
>
> This can't possibly be the right fix.  There's some horrible confusion
> here.  It should not be possible to have read bytes pending _and_ the
> entire folio be uptodate.  That's an invariant that should always be
> maintained.

ifs->read_bytes_pending gets initialized to the folio size, but if the
file being read in is smaller than the size of the folio, then we
reach this scenario because the file has been read in but
ifs->read_bytes_pending is still a positive value because it
represents the bytes between the end of the file and the end of the
folio. If the folio size is 16k and the file size is 4k:
  a) ifs->read_bytes_pending gets initialized to 16k
  b) ->read_folio_range() is called for the 4k read
  c) the 4k read succeeds, ifs->read_bytes_pending is now 12k and the
0 to 4k range is marked uptodate
  d) the post-eof blocks are zeroed and marked uptodate in the call to
iomap_set_range_uptodate()
  e) iomap_set_range_uptodate() sees all the ranges are marked
uptodate and it marks the folio uptodate
  f) iomap_read_end() gets called to subtract the 12k from
ifs->read_bytes_pending. it too sees all the ranges are marked
uptodate and marks the folio uptodate

The same scenario could happen for IOMAP_INLINE mappings if part of
the folio is read in through ->read_folio_range() and then the rest is
read in as inline data.

An alternative solution is to not have zeroed-out / inlined mappings
call iomap_read_end(), eg something like this [1], but this adds
additional complexity and doesn't work if there's additional mappings
for the folio after a non-IOMAP_MAPPED mapping.

Is there a better approach that I'm missing?

Thanks,
Joanne

[1] https://github.com/joannekoong/linux/commit/de48d3c29db8ae654300341e3eec12497df54673

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 1/1] iomap: fix race between iomap_set_range_uptodate and folio_end_read
  2026-02-11 19:33                         ` Joanne Koong
@ 2026-02-11 21:03                           ` Matthew Wilcox
  2026-02-11 23:13                             ` Joanne Koong
  0 siblings, 1 reply; 50+ messages in thread
From: Matthew Wilcox @ 2026-02-11 21:03 UTC (permalink / raw)
  To: Joanne Koong; +Cc: Wei Gao, Sasha Levin, linux-fsdevel, linux-kernel

On Wed, Feb 11, 2026 at 11:33:05AM -0800, Joanne Koong wrote:
> ifs->read_bytes_pending gets initialized to the folio size, but if the
> file being read in is smaller than the size of the folio, then we
> reach this scenario because the file has been read in but
> ifs->read_bytes_pending is still a positive value because it
> represents the bytes between the end of the file and the end of the
> folio. If the folio size is 16k and the file size is 4k:
>   a) ifs->read_bytes_pending gets initialized to 16k
>   b) ->read_folio_range() is called for the 4k read
>   c) the 4k read succeeds, ifs->read_bytes_pending is now 12k and the
> 0 to 4k range is marked uptodate
>   d) the post-eof blocks are zeroed and marked uptodate in the call to
> iomap_set_range_uptodate()

This is the bug then.  If they're marked uptodate, read_bytes_pending
should be decremented at the same time.  Now, I appreciate that
iomap_set_range_uptodate() is called both from iomap_read_folio_iter()
and __iomap_write_begin(), and it can't decrement read_bytes_pending
in the latter case.  Perhaps a flag or a second length parameter is
the solution?

>   e) iomap_set_range_uptodate() sees all the ranges are marked
> uptodate and it marks the folio uptodate
>   f) iomap_read_end() gets called to subtract the 12k from
> ifs->read_bytes_pending. it too sees all the ranges are marked
> uptodate and marks the folio uptodate
> 
> The same scenario could happen for IOMAP_INLINE mappings if part of
> the folio is read in through ->read_folio_range() and then the rest is
> read in as inline data.

This is basically the same case as post-eof.

> An alternative solution is to not have zeroed-out / inlined mappings
> call iomap_read_end(), eg something like this [1], but this adds
> additional complexity and doesn't work if there's additional mappings
> for the folio after a non-IOMAP_MAPPED mapping.
> 
> Is there a better approach that I'm missing?
> 
> Thanks,
> Joanne
> 
> [1] https://github.com/joannekoong/linux/commit/de48d3c29db8ae654300341e3eec12497df54673

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 1/1] iomap: fix race between iomap_set_range_uptodate and folio_end_read
  2026-02-11 21:03                           ` Matthew Wilcox
@ 2026-02-11 23:13                             ` Joanne Koong
  2026-02-12 19:31                               ` Matthew Wilcox
  0 siblings, 1 reply; 50+ messages in thread
From: Joanne Koong @ 2026-02-11 23:13 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Wei Gao, Sasha Levin, linux-fsdevel, linux-kernel

On Wed, Feb 11, 2026 at 1:03 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Wed, Feb 11, 2026 at 11:33:05AM -0800, Joanne Koong wrote:
> > ifs->read_bytes_pending gets initialized to the folio size, but if the
> > file being read in is smaller than the size of the folio, then we
> > reach this scenario because the file has been read in but
> > ifs->read_bytes_pending is still a positive value because it
> > represents the bytes between the end of the file and the end of the
> > folio. If the folio size is 16k and the file size is 4k:
> >   a) ifs->read_bytes_pending gets initialized to 16k
> >   b) ->read_folio_range() is called for the 4k read
> >   c) the 4k read succeeds, ifs->read_bytes_pending is now 12k and the
> > 0 to 4k range is marked uptodate
> >   d) the post-eof blocks are zeroed and marked uptodate in the call to
> > iomap_set_range_uptodate()
>
> This is the bug then.  If they're marked uptodate, read_bytes_pending
> should be decremented at the same time.  Now, I appreciate that
> iomap_set_range_uptodate() is called both from iomap_read_folio_iter()
> and __iomap_write_begin(), and it can't decrement read_bytes_pending
> in the latter case.  Perhaps a flag or a second length parameter is
> the solution?

I don't think it's enough to decrement read_bytes_pending by the
zeroed/read-inline length because there's these two edge cases:
a) some blocks in the folio were already uptodate from the very
beginning and skipped for IO but not decremented yet from
ifs->read_bytes_pending, which means in iomap_read_end(),
ifs->read_bytes_pending would be > 0 and the uptodate flag could get
XORed again. This means we need to also decrement read_bytes_pending
by bytes_submitted as well for this case
b) the async ->read_folio_range() callback finishes after the
zeroing's read_bytes_pending decrement and calls folio_end_read(), so
we need to assign ctx->cur_folio to NULL

I think the code would have to look something like [1] (this is
similar to the alternative approach I mentioned in my previous reply
but fixed up to cover some more edge cases).

Thanks,
Joanne

[1] https://github.com/joannekoong/linux/commit/b42f47726433a8130e8c27d1b43b16e27dfd6960

>
> >   e) iomap_set_range_uptodate() sees all the ranges are marked
> > uptodate and it marks the folio uptodate
> >   f) iomap_read_end() gets called to subtract the 12k from
> > ifs->read_bytes_pending. it too sees all the ranges are marked
> > uptodate and marks the folio uptodate
> >
> > The same scenario could happen for IOMAP_INLINE mappings if part of
> > the folio is read in through ->read_folio_range() and then the rest is
> > read in as inline data.
>
> This is basically the same case as post-eof.
>
> > An alternative solution is to not have zeroed-out / inlined mappings
> > call iomap_read_end(), eg something like this [1], but this adds
> > additional complexity and doesn't work if there's additional mappings
> > for the folio after a non-IOMAP_MAPPED mapping.

(I was wrong about it not working for cases where's additional
mappings after a non-IOMAP_MAPPED mapping, since both
inline-read/zeroing are no-ops if the entire folio is already
uptodate)

 > >
> > Is there a better approach that I'm missing?
> >
> > Thanks,
> > Joanne
> >
> > [1] https://github.com/joannekoong/linux/commit/de48d3c29db8ae654300341e3eec12497df54673

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 1/1] iomap: fix race between iomap_set_range_uptodate and folio_end_read
  2026-02-11 23:13                             ` Joanne Koong
@ 2026-02-12 19:31                               ` Matthew Wilcox
  2026-02-13  0:53                                 ` Joanne Koong
  0 siblings, 1 reply; 50+ messages in thread
From: Matthew Wilcox @ 2026-02-12 19:31 UTC (permalink / raw)
  To: Joanne Koong; +Cc: Wei Gao, Sasha Levin, linux-fsdevel, linux-kernel

On Wed, Feb 11, 2026 at 03:13:48PM -0800, Joanne Koong wrote:
> On Wed, Feb 11, 2026 at 1:03 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Wed, Feb 11, 2026 at 11:33:05AM -0800, Joanne Koong wrote:
> > > ifs->read_bytes_pending gets initialized to the folio size, but if the
> > > file being read in is smaller than the size of the folio, then we
> > > reach this scenario because the file has been read in but
> > > ifs->read_bytes_pending is still a positive value because it
> > > represents the bytes between the end of the file and the end of the
> > > folio. If the folio size is 16k and the file size is 4k:
> > >   a) ifs->read_bytes_pending gets initialized to 16k
> > >   b) ->read_folio_range() is called for the 4k read
> > >   c) the 4k read succeeds, ifs->read_bytes_pending is now 12k and the
> > > 0 to 4k range is marked uptodate
> > >   d) the post-eof blocks are zeroed and marked uptodate in the call to
> > > iomap_set_range_uptodate()
> >
> > This is the bug then.  If they're marked uptodate, read_bytes_pending
> > should be decremented at the same time.  Now, I appreciate that
> > iomap_set_range_uptodate() is called both from iomap_read_folio_iter()
> > and __iomap_write_begin(), and it can't decrement read_bytes_pending
> > in the latter case.  Perhaps a flag or a second length parameter is
> > the solution?
> 
> I don't think it's enough to decrement read_bytes_pending by the
> zeroed/read-inline length because there's these two edge cases:
> a) some blocks in the folio were already uptodate from the very
> beginning and skipped for IO but not decremented yet from
> ifs->read_bytes_pending, which means in iomap_read_end(),
> ifs->read_bytes_pending would be > 0 and the uptodate flag could get
> XORed again. This means we need to also decrement read_bytes_pending
> by bytes_submitted as well for this case

Hm, that's a good one.  It can't happen for readahead, but it can happen
if we start out by writing to some blocks of a folio, then call
read_folio to get the remaining blocks uptodate.  We could avoid it
happening by initialising read_bytes_pending to folio_size() -
bitmap_weight(ifs->uptodate) * block_size.

> b) the async ->read_folio_range() callback finishes after the
> zeroing's read_bytes_pending decrement and calls folio_end_read(), so
> we need to assign ctx->cur_folio to NULL

If we return 'finished' from iomap_finish_folio_read(), we can handle
this?

> I think the code would have to look something like [1] (this is
> similar to the alternative approach I mentioned in my previous reply
> but fixed up to cover some more edge cases).
> 
> Thanks,
> Joanne
> 
> [1] https://github.com/joannekoong/linux/commit/b42f47726433a8130e8c27d1b43b16e27dfd6960

I think we can do everything we need with a suitably modified
iomap_finish_folio_read() rather than the new iomap_finish_read_range().

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 1/1] iomap: fix race between iomap_set_range_uptodate and folio_end_read
  2026-02-12 19:31                               ` Matthew Wilcox
@ 2026-02-13  0:53                                 ` Joanne Koong
  0 siblings, 0 replies; 50+ messages in thread
From: Joanne Koong @ 2026-02-13  0:53 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Wei Gao, Sasha Levin, linux-fsdevel, linux-kernel

On Thu, Feb 12, 2026 at 11:31 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Wed, Feb 11, 2026 at 03:13:48PM -0800, Joanne Koong wrote:
> > On Wed, Feb 11, 2026 at 1:03 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Wed, Feb 11, 2026 at 11:33:05AM -0800, Joanne Koong wrote:
> > > > ifs->read_bytes_pending gets initialized to the folio size, but if the
> > > > file being read in is smaller than the size of the folio, then we
> > > > reach this scenario because the file has been read in but
> > > > ifs->read_bytes_pending is still a positive value because it
> > > > represents the bytes between the end of the file and the end of the
> > > > folio. If the folio size is 16k and the file size is 4k:
> > > >   a) ifs->read_bytes_pending gets initialized to 16k
> > > >   b) ->read_folio_range() is called for the 4k read
> > > >   c) the 4k read succeeds, ifs->read_bytes_pending is now 12k and the
> > > > 0 to 4k range is marked uptodate
> > > >   d) the post-eof blocks are zeroed and marked uptodate in the call to
> > > > iomap_set_range_uptodate()
> > >
> > > This is the bug then.  If they're marked uptodate, read_bytes_pending
> > > should be decremented at the same time.  Now, I appreciate that
> > > iomap_set_range_uptodate() is called both from iomap_read_folio_iter()
> > > and __iomap_write_begin(), and it can't decrement read_bytes_pending
> > > in the latter case.  Perhaps a flag or a second length parameter is
> > > the solution?
> >
> > I don't think it's enough to decrement read_bytes_pending by the
> > zeroed/read-inline length because there's these two edge cases:
> > a) some blocks in the folio were already uptodate from the very
> > beginning and skipped for IO but not decremented yet from
> > ifs->read_bytes_pending, which means in iomap_read_end(),
> > ifs->read_bytes_pending would be > 0 and the uptodate flag could get
> > XORed again. This means we need to also decrement read_bytes_pending
> > by bytes_submitted as well for this case
>
> Hm, that's a good one.  It can't happen for readahead, but it can happen
> if we start out by writing to some blocks of a folio, then call
> read_folio to get the remaining blocks uptodate.  We could avoid it
> happening by initialising read_bytes_pending to folio_size() -
> bitmap_weight(ifs->uptodate) * block_size.

This is an interesting idea but if we do this then I think this adds
some more edge cases. For example, the range being inlined or zeroed
may have some already uptodate blocks (eg from a prior buffered write)
so we'll need to calculate how many already-existing uptodate bytes
there are in that range to avoid over-decrementing
ifs->read_bytes_pending. I think we would also have to move the
ifs_alloc() and iomap_read_init() calls to the very beginning of
iomap_read_folio_iter() before any iomap_read_inline_data() call
because there could be the case where a folio has an ifs that was
allocated from a prior write, so if we call iomap_finish_folio_read()
after iomap_read_inline_data(), the folio's ifs->read_bytes_pending
now must be initialized before the inline read. Whereas before, we had
some more optimal behavior with being able to entirely skip the ifs
allocation and read initialization if the entire folio gets read
inline.

>
> > b) the async ->read_folio_range() callback finishes after the
> > zeroing's read_bytes_pending decrement and calls folio_end_read(), so
> > we need to assign ctx->cur_folio to NULL
>
> If we return 'finished' from iomap_finish_folio_read(), we can handle
> this?

I think there is still this scenario:
- ->read_folio gets called on an 8k-size folio for a 4k-size file
- iomap_read_init() is called, ifs->read_bytes_pending is now 8k
- make async ->read_folio_range() call to read in 4k
- iomap zeroes out folio from 4k to 8k, then calls
iomap_finish_folio_read() with off = 4k and len = 4k
- in iomap_finish_folio_read(), decrement ifs->read_bytes_pending by
len. ifs->read_bytes_pending is now 4k
- async ->read_folio_range() completes read, calls
iomap_finish_folio_read() with off=0 and len = 4k, which now
decrements ifs->read_bytes_pending by 4k. read_bytes_pending is now 0,
so folio_end_read() gets called. folio should now not be touched by
iomap
- iomap still has valid ctx->cur_folio, and calls iomap_read_end on
ctx->cur_folio

This is the same issue as the one in
https://lore.kernel.org/linux-fsdevel/20260126224107.2182262-2-joannelkoong@gmail.com/

We could always set ctx->cur_folio to NULL after inline/zeroing calls
iomap_finish_folio_read() regardless of whether it actually ended the
read or not, but then this runs into issues for zeroing. The zeroing
can be triggered by non-EOF cases, eg if the first mapping is an
IOMAP_HOLE and then the rest of hte folio is mapped. We may still need
to read in the rest of the folio, so we can't just set ctx->cur_folio
to NULL. i guess one workaround is to explicitly check if the zeroing
is for IOMAP_MAPPED types and if so then always set ctx->cur_folio to
NULL, but I think this just gets uglier / more complex to understand
and I'm not sure if there's other edge cases I'm missing that we'd
need to account for. One other idea is to try avoiding the
iomap_end_read() call for non-error cases if we use your
bitmap_weight() idea above, then it wouldn't matter in that scenario
above if ctx->cur_folio points to a folio that already had read ended
on it. But I think that also just makes the code harder to
read/understand.

The original patch seemed cleanest to me, maybe if we renamed uptodate
to mark_uptodate, it'd be more appetible?  eg

@@ -80,18 +80,19 @@ static void iomap_set_range_uptodate(struct folio
*folio, size_t off,
 {
        struct iomap_folio_state *ifs = folio->private;
        unsigned long flags;
-       bool uptodate = true;
+       bool mark_uptodate = true;

        if (folio_test_uptodate(folio))
                return;

        if (ifs) {
                spin_lock_irqsave(&ifs->state_lock, flags);
-               uptodate = ifs_set_range_uptodate(folio, ifs, off, len);
+               mark_uptodate = ifs_set_range_uptodate(folio, ifs, off, len) &&
+                       !ifs->read_bytes_pending;
                spin_unlock_irqrestore(&ifs->state_lock, flags);
        }

-       if (uptodate)
+       if (mark_uptodate)
                folio_mark_uptodate(folio);
 }


Thanks,
Joanne

>
> > I think the code would have to look something like [1] (this is
> > similar to the alternative approach I mentioned in my previous reply
> > but fixed up to cover some more edge cases).
> >
> > Thanks,
> > Joanne
> >
> > [1] https://github.com/joannekoong/linux/commit/b42f47726433a8130e8c27d1b43b16e27dfd6960
>
> I think we can do everything we need with a suitably modified
> iomap_finish_folio_read() rather than the new iomap_finish_read_range().

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2026-02-13  0:54 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-26  0:25 [PATCH v5 00/14] fuse: use iomap for buffered reads + readahead Joanne Koong
2025-09-26  0:25 ` [PATCH v5 01/14] iomap: move bio read logic into helper function Joanne Koong
2025-09-26  0:25 ` [PATCH v5 02/14] iomap: move read/readahead bio submission " Joanne Koong
2025-09-26  0:25 ` [PATCH v5 03/14] iomap: store read/readahead bio generically Joanne Koong
2025-09-26  0:25 ` [PATCH v5 04/14] iomap: iterate over folio mapping in iomap_readpage_iter() Joanne Koong
2025-09-26  0:26 ` [PATCH v5 05/14] iomap: rename iomap_readpage_iter() to iomap_read_folio_iter() Joanne Koong
2025-09-26  0:26 ` [PATCH v5 06/14] iomap: rename iomap_readpage_ctx struct to iomap_read_folio_ctx Joanne Koong
2025-09-26  0:26 ` [PATCH v5 07/14] iomap: track pending read bytes more optimally Joanne Koong
2025-10-23 19:34   ` Brian Foster
2025-10-24  0:01     ` Joanne Koong
2025-10-24 16:25       ` Joanne Koong
2025-10-24 17:14         ` Brian Foster
2025-10-24 19:48           ` Joanne Koong
2025-10-24 21:55             ` Joanne Koong
2025-10-27 12:16               ` Brian Foster
2025-10-24 17:21         ` Matthew Wilcox
2025-10-24 19:22           ` Joanne Koong
2025-10-24 20:59             ` Matthew Wilcox
2025-10-24 21:37               ` Darrick J. Wong
2025-10-24 21:58               ` Joanne Koong
2025-09-26  0:26 ` [PATCH v5 08/14] iomap: set accurate iter->pos when reading folio ranges Joanne Koong
2025-09-26  0:26 ` [PATCH v5 09/14] iomap: add caller-provided callbacks for read and readahead Joanne Koong
2025-09-26  0:26 ` [PATCH v5 10/14] iomap: move buffered io bio logic into new file Joanne Koong
2025-09-26  0:26 ` [PATCH v5 11/14] iomap: make iomap_read_folio() a void return Joanne Koong
2025-09-26  0:26 ` [PATCH v5 12/14] fuse: use iomap for read_folio Joanne Koong
2025-12-23 22:30   ` [RFC PATCH 0/1] iomap: fix race between iomap_set_range_uptodate and folio_end_read Sasha Levin
2025-12-23 22:30     ` [RFC PATCH 1/1] " Sasha Levin
2025-12-24  1:12       ` Joanne Koong
2025-12-24  1:31         ` Sasha Levin
2026-02-07  7:16           ` Wei Gao
2026-02-09 19:08             ` Joanne Koong
2026-02-10  0:12               ` Wei Gao
2026-02-10  0:20                 ` Joanne Koong
2026-02-10  0:40                   ` Wei Gao
2026-02-10 22:18                     ` Joanne Koong
2026-02-11  0:00                       ` Sasha Levin
2026-02-11  3:11                       ` Matthew Wilcox
2026-02-11 19:33                         ` Joanne Koong
2026-02-11 21:03                           ` Matthew Wilcox
2026-02-11 23:13                             ` Joanne Koong
2026-02-12 19:31                               ` Matthew Wilcox
2026-02-13  0:53                                 ` Joanne Koong
2025-12-24  2:10         ` Matthew Wilcox
2025-12-24 15:43           ` Sasha Levin
2025-12-24 17:27             ` Matthew Wilcox
2025-12-24 21:21               ` Sasha Levin
2025-12-30  0:58                 ` Joanne Koong
2025-09-26  0:26 ` [PATCH v5 13/14] fuse: use iomap for readahead Joanne Koong
2025-09-26  0:26 ` [PATCH v5 14/14] fuse: remove fc->blkbits workaround for partial writes Joanne Koong
2025-09-29  9:38 ` [PATCH v5 00/14] fuse: use iomap for buffered reads + readahead Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox