[PATCH v5 00/14] fuse: use iomap for buffered reads + readahead

linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v5 00/14] fuse: use iomap for buffered reads + readahead
@ 2025-09-26  0:25 Joanne Koong
  2025-09-26  0:25 ` [PATCH v5 01/14] iomap: move bio read logic into helper function Joanne Koong
                   ` (14 more replies)
  0 siblings, 15 replies; 28+ messages in thread
From: Joanne Koong @ 2025-09-26  0:25 UTC (permalink / raw)
  To: brauner, miklos
  Cc: djwong, hch, hsiangkao, linux-block, gfs2, linux-fsdevel,
	kernel-team, linux-xfs, linux-doc

This series adds fuse iomap support for buffered reads and readahead.
This is needed so that granular uptodate tracking can be used in fuse when
large folios are enabled so that only the non-uptodate portions of the folio
need to be read in instead of having to read in the entire folio. It also is
needed in order to turn on large folios for servers that use the writeback
cache since otherwise there is a race condition that may lead to data
corruption if there is a partial write, then a read and the read happens
before the write has undergone writeback, since otherwise the folio will not
be marked uptodate from the partial write so the read will read in the entire
folio from disk, which will overwrite the partial write.

This is on top of two locally-patched iomap patches [1] [2] patched on top of
commit f1c864be6e88 ("Merge branch 'vfs-6.18.async' into vfs.all") in
Christian's vfs.all tree.

This series was run through fstests on fuse passthrough_hp with an
out-of kernel patch enabling fuse large folios.

This patchset does not enable large folios on fuse yet. That will be part
of a different patchset.

Thanks,
Joanne

[1] https://lore.kernel.org/linux-fsdevel/20250919214250.4144807-1-joannelkoong@gmail.com/
[2] https://lore.kernel.org/linux-fsdevel/20250922180042.1775241-1-joannelkoong@gmail.com/

Changelog
---------
v4: 
https://lore.kernel.org/linux-fsdevel/20250923002353.2961514-1-joannelkoong@gmail.com/
v4 -> v5:
* Add commit for tracking pending read bytes more optimally (patch 7), which
  was suggested by Darrick and improves both the performance and the interface
* Merged "track read/readahead folio ownership internally" patch into patch 7
* Split iomap iter pos change into its own commit (Darrick) (patch 8)

v3:
https://lore.kernel.org/linux-fsdevel/20250916234425.1274735-1-joannelkoong@gmail.com/
v3 -> v4:
* Rebase this on top of patches [1] and [2]
* Fix readahead logic back to checking offset == 0 (patch 4)
* Bias needs to be before/after iomap_iter() (patch 10)
* Rename cur_folio_owned to folio_owned for read_folio (patch 7) (Darrick)

v2:
https://lore.kernel.org/linux-fsdevel/20250908185122.3199171-1-joannelkoong@gmail.com/
v2 -> v3:
* Incorporate Christoph's feedback
- Change naming to iomap_bio_* instead of iomap_xxx_bio
- Take his patch for moving bio logic into its own file (patch 11)
- Make ->read_folio_range interface not need pos arg (patch 9)
- Make ->submit_read return void (patch 9)
- Merge cur_folio_in_bio rename w/ tracking folio_owned internally (patch 7)
- Drop patch propagating error and replace with void return (patch 12)
- Make bias code better to read (patch 10)
* Add WARN_ON_ONCE check in iteration refactoring (patch 4)
* Rename ->read_submit to ->submit_read (patch 9)

v1:
https://lore.kernel.org/linux-fsdevel/20250829235627.4053234-1-joannelkoong@gmail.com/
v1 -> v2:
* Don't pass in caller-provided arg through iter->private, pass it through
  ctx->private instead (Darrick & Christoph)
* Separate 'bias' for ifs->read_bytes_pending into separate patch (Christoph)
* Rework read/readahead interface to take in struct iomap_read_folio_ctx
  (Christoph)
* Add patch for removing fuse fc->blkbits workaround, now that Miklos's tree
  has been merged into Christian's

Joanne Koong (14):
  iomap: move bio read logic into helper function
  iomap: move read/readahead bio submission logic into helper function
  iomap: store read/readahead bio generically
  iomap: iterate over folio mapping in iomap_readpage_iter()
  iomap: rename iomap_readpage_iter() to iomap_read_folio_iter()
  iomap: rename iomap_readpage_ctx struct to iomap_read_folio_ctx
  iomap: track pending read bytes more optimally
  iomap: set accurate iter->pos when reading folio ranges
  iomap: add caller-provided callbacks for read and readahead
  iomap: move buffered io bio logic into new file
  iomap: make iomap_read_folio() a void return
  fuse: use iomap for read_folio
  fuse: use iomap for readahead
  fuse: remove fc->blkbits workaround for partial writes

 .../filesystems/iomap/operations.rst          |  44 +++
 block/fops.c                                  |   5 +-
 fs/erofs/data.c                               |   5 +-
 fs/fuse/dir.c                                 |   2 +-
 fs/fuse/file.c                                | 288 +++++++++++-------
 fs/fuse/fuse_i.h                              |   8 -
 fs/fuse/inode.c                               |  13 +-
 fs/gfs2/aops.c                                |   6 +-
 fs/iomap/Makefile                             |   3 +-
 fs/iomap/bio.c                                |  88 ++++++
 fs/iomap/buffered-io.c                        | 246 +++++++--------
 fs/iomap/internal.h                           |  12 +
 fs/xfs/xfs_aops.c                             |   5 +-
 fs/zonefs/file.c                              |   5 +-
 include/linux/iomap.h                         |  63 +++-
 15 files changed, 505 insertions(+), 288 deletions(-)
 create mode 100644 fs/iomap/bio.c

-- 
2.47.3


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v5 01/14] iomap: move bio read logic into helper function
  2025-09-26  0:25 [PATCH v5 00/14] fuse: use iomap for buffered reads + readahead Joanne Koong
@ 2025-09-26  0:25 ` Joanne Koong
  2025-09-26  0:25 ` [PATCH v5 02/14] iomap: move read/readahead bio submission " Joanne Koong
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 28+ messages in thread
From: Joanne Koong @ 2025-09-26  0:25 UTC (permalink / raw)
  To: brauner, miklos
  Cc: djwong, hch, hsiangkao, linux-block, gfs2, linux-fsdevel,
	kernel-team, linux-xfs, linux-doc

Move the iomap_readpage_iter() bio read logic into a separate helper
function, iomap_bio_read_folio_range(). This is needed to make iomap
read/readahead more generically usable, especially for filesystems that
do not require CONFIG_BLOCK.

Additionally rename buffered write's iomap_read_folio_range() function
to iomap_bio_read_folio_range_sync() to better describe its synchronous
behavior.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/iomap/buffered-io.c | 68 ++++++++++++++++++++++++------------------
 1 file changed, 39 insertions(+), 29 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 9535733ed07a..7e65075b6345 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -367,36 +367,15 @@ struct iomap_readpage_ctx {
 	struct readahead_control *rac;
 };
 
-static int iomap_readpage_iter(struct iomap_iter *iter,
-		struct iomap_readpage_ctx *ctx)
+static void iomap_bio_read_folio_range(const struct iomap_iter *iter,
+		struct iomap_readpage_ctx *ctx, loff_t pos, size_t plen)
 {
+	struct folio *folio = ctx->cur_folio;
 	const struct iomap *iomap = &iter->iomap;
-	loff_t pos = iter->pos;
+	struct iomap_folio_state *ifs = folio->private;
+	size_t poff = offset_in_folio(folio, pos);
 	loff_t length = iomap_length(iter);
-	struct folio *folio = ctx->cur_folio;
-	struct iomap_folio_state *ifs;
-	size_t poff, plen;
 	sector_t sector;
-	int ret;
-
-	if (iomap->type == IOMAP_INLINE) {
-		ret = iomap_read_inline_data(iter, folio);
-		if (ret)
-			return ret;
-		return iomap_iter_advance(iter, length);
-	}
-
-	/* zero post-eof blocks as the page may be mapped */
-	ifs = ifs_alloc(iter->inode, folio, iter->flags);
-	iomap_adjust_read_range(iter->inode, folio, &pos, length, &poff, &plen);
-	if (plen == 0)
-		goto done;
-
-	if (iomap_block_needs_zeroing(iter, pos)) {
-		folio_zero_range(folio, poff, plen);
-		iomap_set_range_uptodate(folio, poff, plen);
-		goto done;
-	}
 
 	ctx->cur_folio_in_bio = true;
 	if (ifs) {
@@ -435,6 +414,37 @@ static int iomap_readpage_iter(struct iomap_iter *iter,
 		ctx->bio->bi_end_io = iomap_read_end_io;
 		bio_add_folio_nofail(ctx->bio, folio, plen, poff);
 	}
+}
+
+static int iomap_readpage_iter(struct iomap_iter *iter,
+		struct iomap_readpage_ctx *ctx)
+{
+	const struct iomap *iomap = &iter->iomap;
+	loff_t pos = iter->pos;
+	loff_t length = iomap_length(iter);
+	struct folio *folio = ctx->cur_folio;
+	size_t poff, plen;
+	int ret;
+
+	if (iomap->type == IOMAP_INLINE) {
+		ret = iomap_read_inline_data(iter, folio);
+		if (ret)
+			return ret;
+		return iomap_iter_advance(iter, length);
+	}
+
+	/* zero post-eof blocks as the page may be mapped */
+	ifs_alloc(iter->inode, folio, iter->flags);
+	iomap_adjust_read_range(iter->inode, folio, &pos, length, &poff, &plen);
+	if (plen == 0)
+		goto done;
+
+	if (iomap_block_needs_zeroing(iter, pos)) {
+		folio_zero_range(folio, poff, plen);
+		iomap_set_range_uptodate(folio, poff, plen);
+	} else {
+		iomap_bio_read_folio_range(iter, ctx, pos, plen);
+	}
 
 done:
 	/*
@@ -559,7 +569,7 @@ void iomap_readahead(struct readahead_control *rac, const struct iomap_ops *ops)
 }
 EXPORT_SYMBOL_GPL(iomap_readahead);
 
-static int iomap_read_folio_range(const struct iomap_iter *iter,
+static int iomap_bio_read_folio_range_sync(const struct iomap_iter *iter,
 		struct folio *folio, loff_t pos, size_t len)
 {
 	const struct iomap *srcmap = iomap_iter_srcmap(iter);
@@ -572,7 +582,7 @@ static int iomap_read_folio_range(const struct iomap_iter *iter,
 	return submit_bio_wait(&bio);
 }
 #else
-static int iomap_read_folio_range(const struct iomap_iter *iter,
+static int iomap_bio_read_folio_range_sync(const struct iomap_iter *iter,
 		struct folio *folio, loff_t pos, size_t len)
 {
 	WARN_ON_ONCE(1);
@@ -749,7 +759,7 @@ static int __iomap_write_begin(const struct iomap_iter *iter,
 				status = write_ops->read_folio_range(iter,
 						folio, block_start, plen);
 			else
-				status = iomap_read_folio_range(iter,
+				status = iomap_bio_read_folio_range_sync(iter,
 						folio, block_start, plen);
 			if (status)
 				return status;
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v5 02/14] iomap: move read/readahead bio submission logic into helper function
  2025-09-26  0:25 [PATCH v5 00/14] fuse: use iomap for buffered reads + readahead Joanne Koong
  2025-09-26  0:25 ` [PATCH v5 01/14] iomap: move bio read logic into helper function Joanne Koong
@ 2025-09-26  0:25 ` Joanne Koong
  2025-09-26  0:25 ` [PATCH v5 03/14] iomap: store read/readahead bio generically Joanne Koong
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 28+ messages in thread
From: Joanne Koong @ 2025-09-26  0:25 UTC (permalink / raw)
  To: brauner, miklos
  Cc: djwong, hch, hsiangkao, linux-block, gfs2, linux-fsdevel,
	kernel-team, linux-xfs, linux-doc

Move the read/readahead bio submission logic into a separate helper.
This is needed to make iomap read/readahead more generically usable,
especially for filesystems that do not require CONFIG_BLOCK.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/iomap/buffered-io.c | 30 ++++++++++++++++--------------
 1 file changed, 16 insertions(+), 14 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 7e65075b6345..f8b985bb5a6b 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -367,6 +367,14 @@ struct iomap_readpage_ctx {
 	struct readahead_control *rac;
 };
 
+static void iomap_bio_submit_read(struct iomap_readpage_ctx *ctx)
+{
+	struct bio *bio = ctx->bio;
+
+	if (bio)
+		submit_bio(bio);
+}
+
 static void iomap_bio_read_folio_range(const struct iomap_iter *iter,
 		struct iomap_readpage_ctx *ctx, loff_t pos, size_t plen)
 {
@@ -392,8 +400,7 @@ static void iomap_bio_read_folio_range(const struct iomap_iter *iter,
 		gfp_t orig_gfp = gfp;
 		unsigned int nr_vecs = DIV_ROUND_UP(length, PAGE_SIZE);
 
-		if (ctx->bio)
-			submit_bio(ctx->bio);
+		iomap_bio_submit_read(ctx);
 
 		if (ctx->rac) /* same as readahead_gfp_mask */
 			gfp |= __GFP_NORETRY | __GFP_NOWARN;
@@ -488,13 +495,10 @@ int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops)
 	while ((ret = iomap_iter(&iter, ops)) > 0)
 		iter.status = iomap_read_folio_iter(&iter, &ctx);
 
-	if (ctx.bio) {
-		submit_bio(ctx.bio);
-		WARN_ON_ONCE(!ctx.cur_folio_in_bio);
-	} else {
-		WARN_ON_ONCE(ctx.cur_folio_in_bio);
+	iomap_bio_submit_read(&ctx);
+
+	if (!ctx.cur_folio_in_bio)
 		folio_unlock(folio);
-	}
 
 	/*
 	 * Just like mpage_readahead and block_read_full_folio, we always
@@ -560,12 +564,10 @@ void iomap_readahead(struct readahead_control *rac, const struct iomap_ops *ops)
 	while (iomap_iter(&iter, ops) > 0)
 		iter.status = iomap_readahead_iter(&iter, &ctx);
 
-	if (ctx.bio)
-		submit_bio(ctx.bio);
-	if (ctx.cur_folio) {
-		if (!ctx.cur_folio_in_bio)
-			folio_unlock(ctx.cur_folio);
-	}
+	iomap_bio_submit_read(&ctx);
+
+	if (ctx.cur_folio && !ctx.cur_folio_in_bio)
+		folio_unlock(ctx.cur_folio);
 }
 EXPORT_SYMBOL_GPL(iomap_readahead);
 
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v5 03/14] iomap: store read/readahead bio generically
  2025-09-26  0:25 [PATCH v5 00/14] fuse: use iomap for buffered reads + readahead Joanne Koong
  2025-09-26  0:25 ` [PATCH v5 01/14] iomap: move bio read logic into helper function Joanne Koong
  2025-09-26  0:25 ` [PATCH v5 02/14] iomap: move read/readahead bio submission " Joanne Koong
@ 2025-09-26  0:25 ` Joanne Koong
  2025-09-26  0:25 ` [PATCH v5 04/14] iomap: iterate over folio mapping in iomap_readpage_iter() Joanne Koong
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 28+ messages in thread
From: Joanne Koong @ 2025-09-26  0:25 UTC (permalink / raw)
  To: brauner, miklos
  Cc: djwong, hch, hsiangkao, linux-block, gfs2, linux-fsdevel,
	kernel-team, linux-xfs, linux-doc

Store the iomap_readpage_ctx bio generically as a "void *read_ctx".
This makes the read/readahead interface more generic, which allows it to
be used by filesystems that may not be block-based and may not have
CONFIG_BLOCK set.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/iomap/buffered-io.c | 29 ++++++++++++++---------------
 1 file changed, 14 insertions(+), 15 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index f8b985bb5a6b..b06b532033ad 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -363,13 +363,13 @@ static void iomap_read_end_io(struct bio *bio)
 struct iomap_readpage_ctx {
 	struct folio		*cur_folio;
 	bool			cur_folio_in_bio;
-	struct bio		*bio;
+	void			*read_ctx;
 	struct readahead_control *rac;
 };
 
 static void iomap_bio_submit_read(struct iomap_readpage_ctx *ctx)
 {
-	struct bio *bio = ctx->bio;
+	struct bio *bio = ctx->read_ctx;
 
 	if (bio)
 		submit_bio(bio);
@@ -384,6 +384,7 @@ static void iomap_bio_read_folio_range(const struct iomap_iter *iter,
 	size_t poff = offset_in_folio(folio, pos);
 	loff_t length = iomap_length(iter);
 	sector_t sector;
+	struct bio *bio = ctx->read_ctx;
 
 	ctx->cur_folio_in_bio = true;
 	if (ifs) {
@@ -393,9 +394,8 @@ static void iomap_bio_read_folio_range(const struct iomap_iter *iter,
 	}
 
 	sector = iomap_sector(iomap, pos);
-	if (!ctx->bio ||
-	    bio_end_sector(ctx->bio) != sector ||
-	    !bio_add_folio(ctx->bio, folio, plen, poff)) {
+	if (!bio || bio_end_sector(bio) != sector ||
+	    !bio_add_folio(bio, folio, plen, poff)) {
 		gfp_t gfp = mapping_gfp_constraint(folio->mapping, GFP_KERNEL);
 		gfp_t orig_gfp = gfp;
 		unsigned int nr_vecs = DIV_ROUND_UP(length, PAGE_SIZE);
@@ -404,22 +404,21 @@ static void iomap_bio_read_folio_range(const struct iomap_iter *iter,
 
 		if (ctx->rac) /* same as readahead_gfp_mask */
 			gfp |= __GFP_NORETRY | __GFP_NOWARN;
-		ctx->bio = bio_alloc(iomap->bdev, bio_max_segs(nr_vecs),
-				     REQ_OP_READ, gfp);
+		bio = bio_alloc(iomap->bdev, bio_max_segs(nr_vecs), REQ_OP_READ,
+				     gfp);
 		/*
 		 * If the bio_alloc fails, try it again for a single page to
 		 * avoid having to deal with partial page reads.  This emulates
 		 * what do_mpage_read_folio does.
 		 */
-		if (!ctx->bio) {
-			ctx->bio = bio_alloc(iomap->bdev, 1, REQ_OP_READ,
-					     orig_gfp);
-		}
+		if (!bio)
+			bio = bio_alloc(iomap->bdev, 1, REQ_OP_READ, orig_gfp);
 		if (ctx->rac)
-			ctx->bio->bi_opf |= REQ_RAHEAD;
-		ctx->bio->bi_iter.bi_sector = sector;
-		ctx->bio->bi_end_io = iomap_read_end_io;
-		bio_add_folio_nofail(ctx->bio, folio, plen, poff);
+			bio->bi_opf |= REQ_RAHEAD;
+		bio->bi_iter.bi_sector = sector;
+		bio->bi_end_io = iomap_read_end_io;
+		bio_add_folio_nofail(bio, folio, plen, poff);
+		ctx->read_ctx = bio;
 	}
 }
 
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v5 04/14] iomap: iterate over folio mapping in iomap_readpage_iter()
  2025-09-26  0:25 [PATCH v5 00/14] fuse: use iomap for buffered reads + readahead Joanne Koong
                   ` (2 preceding siblings ...)
  2025-09-26  0:25 ` [PATCH v5 03/14] iomap: store read/readahead bio generically Joanne Koong
@ 2025-09-26  0:25 ` Joanne Koong
  2025-09-26  0:26 ` [PATCH v5 05/14] iomap: rename iomap_readpage_iter() to iomap_read_folio_iter() Joanne Koong
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 28+ messages in thread
From: Joanne Koong @ 2025-09-26  0:25 UTC (permalink / raw)
  To: brauner, miklos
  Cc: djwong, hch, hsiangkao, linux-block, gfs2, linux-fsdevel,
	kernel-team, linux-xfs, linux-doc, Christoph Hellwig

Iterate over all non-uptodate ranges of a folio mapping in a single call
to iomap_readpage_iter() instead of leaving the partial iteration to the
caller.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/iomap/buffered-io.c | 53 ++++++++++++++++++++----------------------
 1 file changed, 25 insertions(+), 28 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index b06b532033ad..dbe5783ee68c 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -430,6 +430,7 @@ static int iomap_readpage_iter(struct iomap_iter *iter,
 	loff_t length = iomap_length(iter);
 	struct folio *folio = ctx->cur_folio;
 	size_t poff, plen;
+	loff_t count;
 	int ret;
 
 	if (iomap->type == IOMAP_INLINE) {
@@ -439,41 +440,35 @@ static int iomap_readpage_iter(struct iomap_iter *iter,
 		return iomap_iter_advance(iter, length);
 	}
 
-	/* zero post-eof blocks as the page may be mapped */
 	ifs_alloc(iter->inode, folio, iter->flags);
-	iomap_adjust_read_range(iter->inode, folio, &pos, length, &poff, &plen);
-	if (plen == 0)
-		goto done;
 
-	if (iomap_block_needs_zeroing(iter, pos)) {
-		folio_zero_range(folio, poff, plen);
-		iomap_set_range_uptodate(folio, poff, plen);
-	} else {
-		iomap_bio_read_folio_range(iter, ctx, pos, plen);
-	}
+	length = min_t(loff_t, length,
+			folio_size(folio) - offset_in_folio(folio, pos));
+	while (length) {
+		iomap_adjust_read_range(iter->inode, folio, &pos, length, &poff,
+				&plen);
 
-done:
-	/*
-	 * Move the caller beyond our range so that it keeps making progress.
-	 * For that, we have to include any leading non-uptodate ranges, but
-	 * we can skip trailing ones as they will be handled in the next
-	 * iteration.
-	 */
-	length = pos - iter->pos + plen;
-	return iomap_iter_advance(iter, length);
-}
+		count = pos - iter->pos + plen;
+		if (WARN_ON_ONCE(count > length))
+			return -EIO;
 
-static int iomap_read_folio_iter(struct iomap_iter *iter,
-		struct iomap_readpage_ctx *ctx)
-{
-	int ret;
+		if (plen == 0)
+			return iomap_iter_advance(iter, count);
 
-	while (iomap_length(iter)) {
-		ret = iomap_readpage_iter(iter, ctx);
+		/* zero post-eof blocks as the page may be mapped */
+		if (iomap_block_needs_zeroing(iter, pos)) {
+			folio_zero_range(folio, poff, plen);
+			iomap_set_range_uptodate(folio, poff, plen);
+		} else {
+			iomap_bio_read_folio_range(iter, ctx, pos, plen);
+		}
+
+		ret = iomap_iter_advance(iter, count);
 		if (ret)
 			return ret;
+		length -= count;
+		pos = iter->pos;
 	}
-
 	return 0;
 }
 
@@ -492,7 +487,7 @@ int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops)
 	trace_iomap_readpage(iter.inode, 1);
 
 	while ((ret = iomap_iter(&iter, ops)) > 0)
-		iter.status = iomap_read_folio_iter(&iter, &ctx);
+		iter.status = iomap_readpage_iter(&iter, &ctx);
 
 	iomap_bio_submit_read(&ctx);
 
@@ -522,6 +517,8 @@ static int iomap_readahead_iter(struct iomap_iter *iter,
 		}
 		if (!ctx->cur_folio) {
 			ctx->cur_folio = readahead_folio(ctx->rac);
+			if (WARN_ON_ONCE(!ctx->cur_folio))
+				return -EINVAL;
 			ctx->cur_folio_in_bio = false;
 		}
 		ret = iomap_readpage_iter(iter, ctx);
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v5 05/14] iomap: rename iomap_readpage_iter() to iomap_read_folio_iter()
  2025-09-26  0:25 [PATCH v5 00/14] fuse: use iomap for buffered reads + readahead Joanne Koong
                   ` (3 preceding siblings ...)
  2025-09-26  0:25 ` [PATCH v5 04/14] iomap: iterate over folio mapping in iomap_readpage_iter() Joanne Koong
@ 2025-09-26  0:26 ` Joanne Koong
  2025-09-26  0:26 ` [PATCH v5 06/14] iomap: rename iomap_readpage_ctx struct to iomap_read_folio_ctx Joanne Koong
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 28+ messages in thread
From: Joanne Koong @ 2025-09-26  0:26 UTC (permalink / raw)
  To: brauner, miklos
  Cc: djwong, hch, hsiangkao, linux-block, gfs2, linux-fsdevel,
	kernel-team, linux-xfs, linux-doc, Christoph Hellwig

->readpage was deprecated and reads are now on folios.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap/buffered-io.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index dbe5783ee68c..23601373573e 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -422,7 +422,7 @@ static void iomap_bio_read_folio_range(const struct iomap_iter *iter,
 	}
 }
 
-static int iomap_readpage_iter(struct iomap_iter *iter,
+static int iomap_read_folio_iter(struct iomap_iter *iter,
 		struct iomap_readpage_ctx *ctx)
 {
 	const struct iomap *iomap = &iter->iomap;
@@ -487,7 +487,7 @@ int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops)
 	trace_iomap_readpage(iter.inode, 1);
 
 	while ((ret = iomap_iter(&iter, ops)) > 0)
-		iter.status = iomap_readpage_iter(&iter, &ctx);
+		iter.status = iomap_read_folio_iter(&iter, &ctx);
 
 	iomap_bio_submit_read(&ctx);
 
@@ -521,7 +521,7 @@ static int iomap_readahead_iter(struct iomap_iter *iter,
 				return -EINVAL;
 			ctx->cur_folio_in_bio = false;
 		}
-		ret = iomap_readpage_iter(iter, ctx);
+		ret = iomap_read_folio_iter(iter, ctx);
 		if (ret)
 			return ret;
 	}
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v5 06/14] iomap: rename iomap_readpage_ctx struct to iomap_read_folio_ctx
  2025-09-26  0:25 [PATCH v5 00/14] fuse: use iomap for buffered reads + readahead Joanne Koong
                   ` (4 preceding siblings ...)
  2025-09-26  0:26 ` [PATCH v5 05/14] iomap: rename iomap_readpage_iter() to iomap_read_folio_iter() Joanne Koong
@ 2025-09-26  0:26 ` Joanne Koong
  2025-09-26  0:26 ` [PATCH v5 07/14] iomap: track pending read bytes more optimally Joanne Koong
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 28+ messages in thread
From: Joanne Koong @ 2025-09-26  0:26 UTC (permalink / raw)
  To: brauner, miklos
  Cc: djwong, hch, hsiangkao, linux-block, gfs2, linux-fsdevel,
	kernel-team, linux-xfs, linux-doc, Christoph Hellwig

->readpage was deprecated and reads are now on folios.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap/buffered-io.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 23601373573e..09e65771a947 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -360,14 +360,14 @@ static void iomap_read_end_io(struct bio *bio)
 	bio_put(bio);
 }
 
-struct iomap_readpage_ctx {
+struct iomap_read_folio_ctx {
 	struct folio		*cur_folio;
 	bool			cur_folio_in_bio;
 	void			*read_ctx;
 	struct readahead_control *rac;
 };
 
-static void iomap_bio_submit_read(struct iomap_readpage_ctx *ctx)
+static void iomap_bio_submit_read(struct iomap_read_folio_ctx *ctx)
 {
 	struct bio *bio = ctx->read_ctx;
 
@@ -376,7 +376,7 @@ static void iomap_bio_submit_read(struct iomap_readpage_ctx *ctx)
 }
 
 static void iomap_bio_read_folio_range(const struct iomap_iter *iter,
-		struct iomap_readpage_ctx *ctx, loff_t pos, size_t plen)
+		struct iomap_read_folio_ctx *ctx, loff_t pos, size_t plen)
 {
 	struct folio *folio = ctx->cur_folio;
 	const struct iomap *iomap = &iter->iomap;
@@ -423,7 +423,7 @@ static void iomap_bio_read_folio_range(const struct iomap_iter *iter,
 }
 
 static int iomap_read_folio_iter(struct iomap_iter *iter,
-		struct iomap_readpage_ctx *ctx)
+		struct iomap_read_folio_ctx *ctx)
 {
 	const struct iomap *iomap = &iter->iomap;
 	loff_t pos = iter->pos;
@@ -479,7 +479,7 @@ int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops)
 		.pos		= folio_pos(folio),
 		.len		= folio_size(folio),
 	};
-	struct iomap_readpage_ctx ctx = {
+	struct iomap_read_folio_ctx ctx = {
 		.cur_folio	= folio,
 	};
 	int ret;
@@ -504,7 +504,7 @@ int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops)
 EXPORT_SYMBOL_GPL(iomap_read_folio);
 
 static int iomap_readahead_iter(struct iomap_iter *iter,
-		struct iomap_readpage_ctx *ctx)
+		struct iomap_read_folio_ctx *ctx)
 {
 	int ret;
 
@@ -551,7 +551,7 @@ void iomap_readahead(struct readahead_control *rac, const struct iomap_ops *ops)
 		.pos	= readahead_pos(rac),
 		.len	= readahead_length(rac),
 	};
-	struct iomap_readpage_ctx ctx = {
+	struct iomap_read_folio_ctx ctx = {
 		.rac	= rac,
 	};
 
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v5 07/14] iomap: track pending read bytes more optimally
  2025-09-26  0:25 [PATCH v5 00/14] fuse: use iomap for buffered reads + readahead Joanne Koong
                   ` (5 preceding siblings ...)
  2025-09-26  0:26 ` [PATCH v5 06/14] iomap: rename iomap_readpage_ctx struct to iomap_read_folio_ctx Joanne Koong
@ 2025-09-26  0:26 ` Joanne Koong
  2025-10-23 19:34   ` Brian Foster
  2025-09-26  0:26 ` [PATCH v5 08/14] iomap: set accurate iter->pos when reading folio ranges Joanne Koong
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 28+ messages in thread
From: Joanne Koong @ 2025-09-26  0:26 UTC (permalink / raw)
  To: brauner, miklos
  Cc: djwong, hch, hsiangkao, linux-block, gfs2, linux-fsdevel,
	kernel-team, linux-xfs, linux-doc

Instead of incrementing read_bytes_pending for every folio range read in
(which requires acquiring the spinlock to do so), set read_bytes_pending
to the folio size when the first range is asynchronously read in, keep
track of how many bytes total are asynchronously read in, and adjust
read_bytes_pending accordingly after issuing requests to read in all the
necessary ranges.

iomap_read_folio_ctx->cur_folio_in_bio can be removed since a non-zero
value for pending bytes necessarily indicates the folio is in the bio.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Suggested-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/iomap/buffered-io.c | 87 ++++++++++++++++++++++++++++++++----------
 1 file changed, 66 insertions(+), 21 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 09e65771a947..4e6258fdb915 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -362,7 +362,6 @@ static void iomap_read_end_io(struct bio *bio)
 
 struct iomap_read_folio_ctx {
 	struct folio		*cur_folio;
-	bool			cur_folio_in_bio;
 	void			*read_ctx;
 	struct readahead_control *rac;
 };
@@ -380,19 +379,11 @@ static void iomap_bio_read_folio_range(const struct iomap_iter *iter,
 {
 	struct folio *folio = ctx->cur_folio;
 	const struct iomap *iomap = &iter->iomap;
-	struct iomap_folio_state *ifs = folio->private;
 	size_t poff = offset_in_folio(folio, pos);
 	loff_t length = iomap_length(iter);
 	sector_t sector;
 	struct bio *bio = ctx->read_ctx;
 
-	ctx->cur_folio_in_bio = true;
-	if (ifs) {
-		spin_lock_irq(&ifs->state_lock);
-		ifs->read_bytes_pending += plen;
-		spin_unlock_irq(&ifs->state_lock);
-	}
-
 	sector = iomap_sector(iomap, pos);
 	if (!bio || bio_end_sector(bio) != sector ||
 	    !bio_add_folio(bio, folio, plen, poff)) {
@@ -422,8 +413,57 @@ static void iomap_bio_read_folio_range(const struct iomap_iter *iter,
 	}
 }
 
+static void iomap_read_init(struct folio *folio)
+{
+	struct iomap_folio_state *ifs = folio->private;
+
+	if (ifs) {
+		size_t len = folio_size(folio);
+
+		spin_lock_irq(&ifs->state_lock);
+		ifs->read_bytes_pending += len;
+		spin_unlock_irq(&ifs->state_lock);
+	}
+}
+
+static void iomap_read_end(struct folio *folio, size_t bytes_pending)
+{
+	struct iomap_folio_state *ifs;
+
+	/*
+	 * If there are no bytes pending, this means we are responsible for
+	 * unlocking the folio here, since no IO helper has taken ownership of
+	 * it.
+	 */
+	if (!bytes_pending) {
+		folio_unlock(folio);
+		return;
+	}
+
+	ifs = folio->private;
+	if (ifs) {
+		bool end_read, uptodate;
+		size_t bytes_accounted = folio_size(folio) - bytes_pending;
+
+		spin_lock_irq(&ifs->state_lock);
+		ifs->read_bytes_pending -= bytes_accounted;
+		/*
+		 * If !ifs->read_bytes_pending, this means all pending reads
+		 * by the IO helper have already completed, which means we need
+		 * to end the folio read here. If ifs->read_bytes_pending != 0,
+		 * the IO helper will end the folio read.
+		 */
+		end_read = !ifs->read_bytes_pending;
+		if (end_read)
+			uptodate = ifs_is_fully_uptodate(folio, ifs);
+		spin_unlock_irq(&ifs->state_lock);
+		if (end_read)
+			folio_end_read(folio, uptodate);
+	}
+}
+
 static int iomap_read_folio_iter(struct iomap_iter *iter,
-		struct iomap_read_folio_ctx *ctx)
+		struct iomap_read_folio_ctx *ctx, size_t *bytes_pending)
 {
 	const struct iomap *iomap = &iter->iomap;
 	loff_t pos = iter->pos;
@@ -460,6 +500,9 @@ static int iomap_read_folio_iter(struct iomap_iter *iter,
 			folio_zero_range(folio, poff, plen);
 			iomap_set_range_uptodate(folio, poff, plen);
 		} else {
+			if (!*bytes_pending)
+				iomap_read_init(folio);
+			*bytes_pending += plen;
 			iomap_bio_read_folio_range(iter, ctx, pos, plen);
 		}
 
@@ -482,17 +525,18 @@ int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops)
 	struct iomap_read_folio_ctx ctx = {
 		.cur_folio	= folio,
 	};
+	size_t bytes_pending = 0;
 	int ret;
 
 	trace_iomap_readpage(iter.inode, 1);
 
 	while ((ret = iomap_iter(&iter, ops)) > 0)
-		iter.status = iomap_read_folio_iter(&iter, &ctx);
+		iter.status = iomap_read_folio_iter(&iter, &ctx,
+				&bytes_pending);
 
 	iomap_bio_submit_read(&ctx);
 
-	if (!ctx.cur_folio_in_bio)
-		folio_unlock(folio);
+	iomap_read_end(folio, bytes_pending);
 
 	/*
 	 * Just like mpage_readahead and block_read_full_folio, we always
@@ -504,24 +548,23 @@ int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops)
 EXPORT_SYMBOL_GPL(iomap_read_folio);
 
 static int iomap_readahead_iter(struct iomap_iter *iter,
-		struct iomap_read_folio_ctx *ctx)
+		struct iomap_read_folio_ctx *ctx, size_t *cur_bytes_pending)
 {
 	int ret;
 
 	while (iomap_length(iter)) {
 		if (ctx->cur_folio &&
 		    offset_in_folio(ctx->cur_folio, iter->pos) == 0) {
-			if (!ctx->cur_folio_in_bio)
-				folio_unlock(ctx->cur_folio);
+			iomap_read_end(ctx->cur_folio, *cur_bytes_pending);
 			ctx->cur_folio = NULL;
 		}
 		if (!ctx->cur_folio) {
 			ctx->cur_folio = readahead_folio(ctx->rac);
 			if (WARN_ON_ONCE(!ctx->cur_folio))
 				return -EINVAL;
-			ctx->cur_folio_in_bio = false;
+			*cur_bytes_pending = 0;
 		}
-		ret = iomap_read_folio_iter(iter, ctx);
+		ret = iomap_read_folio_iter(iter, ctx, cur_bytes_pending);
 		if (ret)
 			return ret;
 	}
@@ -554,16 +597,18 @@ void iomap_readahead(struct readahead_control *rac, const struct iomap_ops *ops)
 	struct iomap_read_folio_ctx ctx = {
 		.rac	= rac,
 	};
+	size_t cur_bytes_pending;
 
 	trace_iomap_readahead(rac->mapping->host, readahead_count(rac));
 
 	while (iomap_iter(&iter, ops) > 0)
-		iter.status = iomap_readahead_iter(&iter, &ctx);
+		iter.status = iomap_readahead_iter(&iter, &ctx,
+					&cur_bytes_pending);
 
 	iomap_bio_submit_read(&ctx);
 
-	if (ctx.cur_folio && !ctx.cur_folio_in_bio)
-		folio_unlock(ctx.cur_folio);
+	if (ctx.cur_folio)
+		iomap_read_end(ctx.cur_folio, cur_bytes_pending);
 }
 EXPORT_SYMBOL_GPL(iomap_readahead);
 
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 07/14] iomap: track pending read bytes more optimally
  2025-09-26  0:26 ` [PATCH v5 07/14] iomap: track pending read bytes more optimally Joanne Koong
@ 2025-10-23 19:34   ` Brian Foster
  2025-10-24  0:01     ` Joanne Koong
  0 siblings, 1 reply; 28+ messages in thread
From: Brian Foster @ 2025-10-23 19:34 UTC (permalink / raw)
  To: Joanne Koong
  Cc: brauner, miklos, djwong, hch, hsiangkao, linux-block, gfs2,
	linux-fsdevel, kernel-team, linux-xfs, linux-doc

On Thu, Sep 25, 2025 at 05:26:02PM -0700, Joanne Koong wrote:
> Instead of incrementing read_bytes_pending for every folio range read in
> (which requires acquiring the spinlock to do so), set read_bytes_pending
> to the folio size when the first range is asynchronously read in, keep
> track of how many bytes total are asynchronously read in, and adjust
> read_bytes_pending accordingly after issuing requests to read in all the
> necessary ranges.
> 
> iomap_read_folio_ctx->cur_folio_in_bio can be removed since a non-zero
> value for pending bytes necessarily indicates the folio is in the bio.
> 
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> Suggested-by: "Darrick J. Wong" <djwong@kernel.org>
> ---

Hi Joanne,

I was throwing some extra testing at the vfs-6.19.iomap branch since the
little merge conflict thing with iomap_iter_advance(). I end up hitting
what appears to be a lockup on XFS with 1k FSB (-bsize=1k) running
generic/051. It reproduces fairly reliably within a few iterations or so
and seems to always stall during a read for a dedupe operation:

task:fsstress        state:D stack:0     pid:12094 tgid:12094 ppid:12091  task_flags:0x400140 flags:0x00080003
Call Trace:
 <TASK>
 __schedule+0x2fc/0x7a0
 schedule+0x27/0x80
 io_schedule+0x46/0x70
 folio_wait_bit_common+0x12b/0x310
 ? __pfx_wake_page_function+0x10/0x10
 ? __pfx_xfs_vm_read_folio+0x10/0x10 [xfs]
 filemap_read_folio+0x85/0xd0
 ? __pfx_xfs_vm_read_folio+0x10/0x10 [xfs]
 do_read_cache_folio+0x7c/0x1b0
 vfs_dedupe_file_range_compare.constprop.0+0xaf/0x2d0
 __generic_remap_file_range_prep+0x276/0x2a0
 generic_remap_file_range_prep+0x10/0x20
 xfs_reflink_remap_prep+0x22c/0x300 [xfs]
 xfs_file_remap_range+0x84/0x360 [xfs]
 vfs_dedupe_file_range_one+0x1b2/0x1d0
 ? remap_verify_area+0x46/0x140
 vfs_dedupe_file_range+0x162/0x220
 do_vfs_ioctl+0x4d1/0x940
 __x64_sys_ioctl+0x75/0xe0
 do_syscall_64+0x84/0x800
 ? do_syscall_64+0xbb/0x800
 ? avc_has_perm_noaudit+0x6b/0xf0
 ? _copy_to_user+0x31/0x40
 ? cp_new_stat+0x130/0x170
 ? __do_sys_newfstat+0x44/0x70
 ? do_syscall_64+0xbb/0x800
 ? do_syscall_64+0xbb/0x800
 ? clear_bhb_loop+0x30/0x80
 ? clear_bhb_loop+0x30/0x80
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7fe6bbd9a14d
RSP: 002b:00007ffde72cd4e0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 0000000000000068 RCX: 00007fe6bbd9a14d
RDX: 000000000a1394b0 RSI: 00000000c0189436 RDI: 0000000000000004
RBP: 00007ffde72cd530 R08: 0000000000001000 R09: 000000000a11a3fc
R10: 000000000001d6c0 R11: 0000000000000246 R12: 000000000a12cfb0
R13: 000000000a12ba10 R14: 000000000a14e610 R15: 0000000000019000
 </TASK>

It wasn't immediately clear to me what the issue was so I bisected and
it landed on this patch. It kind of looks like we're failing to unlock a
folio at some point and then tripping over it later..? I can kill the
fsstress process but then the umount ultimately gets stuck tossing
pagecache [1], so the mount still ends up stuck indefinitely. Anyways,
I'll poke at it some more but I figure you might be able to make sense
of this faster than I can.

Brian

[1] umount stack trace: 

task:umount          state:D stack:0     pid:12216 tgid:12216 ppid:2514   task_flags:0x400100 flags:0x00080001
Call Trace:
 <TASK>
 __schedule+0x2fc/0x7a0
 schedule+0x27/0x80
 io_schedule+0x46/0x70
 folio_wait_bit_common+0x12b/0x310
 ? __pfx_wake_page_function+0x10/0x10
 truncate_inode_pages_range+0x42a/0x4d0
 xfs_fs_evict_inode+0x1f/0x30 [xfs]
 evict+0x112/0x290
 evict_inodes+0x209/0x230
 generic_shutdown_super+0x42/0x100
 kill_block_super+0x1a/0x40
 xfs_kill_sb+0x12/0x20 [xfs]
 deactivate_locked_super+0x33/0xb0
 cleanup_mnt+0xba/0x150
 task_work_run+0x5c/0x90
 exit_to_user_mode_loop+0x12f/0x170
 do_syscall_64+0x1af/0x800
 ? vfs_statx+0x80/0x160
 ? do_statx+0x62/0xa0
 ? __x64_sys_statx+0xaf/0x100
 ? do_syscall_64+0xbb/0x800
 ? __x64_sys_statx+0xaf/0x100
 ? do_syscall_64+0xbb/0x800
 ? count_memcg_events+0xdd/0x1b0
 ? handle_mm_fault+0x220/0x340
 ? do_user_addr_fault+0x2c3/0x7f0
 ? clear_bhb_loop+0x30/0x80
 ? clear_bhb_loop+0x30/0x80
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7fdd641ed5ab
RSP: 002b:00007ffd671182e8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 0000559b3e2056b0 RCX: 00007fdd641ed5ab
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000559b3e205ac0
RBP: 00007ffd671183c0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000103 R11: 0000000000000246 R12: 0000559b3e2057b8
R13: 0000000000000000 R14: 0000559b3e205ac0 R15: 0000000000000000
 </TASK>

>  fs/iomap/buffered-io.c | 87 ++++++++++++++++++++++++++++++++----------
>  1 file changed, 66 insertions(+), 21 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 09e65771a947..4e6258fdb915 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -362,7 +362,6 @@ static void iomap_read_end_io(struct bio *bio)
>  
>  struct iomap_read_folio_ctx {
>  	struct folio		*cur_folio;
> -	bool			cur_folio_in_bio;
>  	void			*read_ctx;
>  	struct readahead_control *rac;
>  };
> @@ -380,19 +379,11 @@ static void iomap_bio_read_folio_range(const struct iomap_iter *iter,
>  {
>  	struct folio *folio = ctx->cur_folio;
>  	const struct iomap *iomap = &iter->iomap;
> -	struct iomap_folio_state *ifs = folio->private;
>  	size_t poff = offset_in_folio(folio, pos);
>  	loff_t length = iomap_length(iter);
>  	sector_t sector;
>  	struct bio *bio = ctx->read_ctx;
>  
> -	ctx->cur_folio_in_bio = true;
> -	if (ifs) {
> -		spin_lock_irq(&ifs->state_lock);
> -		ifs->read_bytes_pending += plen;
> -		spin_unlock_irq(&ifs->state_lock);
> -	}
> -
>  	sector = iomap_sector(iomap, pos);
>  	if (!bio || bio_end_sector(bio) != sector ||
>  	    !bio_add_folio(bio, folio, plen, poff)) {
> @@ -422,8 +413,57 @@ static void iomap_bio_read_folio_range(const struct iomap_iter *iter,
>  	}
>  }
>  
> +static void iomap_read_init(struct folio *folio)
> +{
> +	struct iomap_folio_state *ifs = folio->private;
> +
> +	if (ifs) {
> +		size_t len = folio_size(folio);
> +
> +		spin_lock_irq(&ifs->state_lock);
> +		ifs->read_bytes_pending += len;
> +		spin_unlock_irq(&ifs->state_lock);
> +	}
> +}
> +
> +static void iomap_read_end(struct folio *folio, size_t bytes_pending)
> +{
> +	struct iomap_folio_state *ifs;
> +
> +	/*
> +	 * If there are no bytes pending, this means we are responsible for
> +	 * unlocking the folio here, since no IO helper has taken ownership of
> +	 * it.
> +	 */
> +	if (!bytes_pending) {
> +		folio_unlock(folio);
> +		return;
> +	}
> +
> +	ifs = folio->private;
> +	if (ifs) {
> +		bool end_read, uptodate;
> +		size_t bytes_accounted = folio_size(folio) - bytes_pending;
> +
> +		spin_lock_irq(&ifs->state_lock);
> +		ifs->read_bytes_pending -= bytes_accounted;
> +		/*
> +		 * If !ifs->read_bytes_pending, this means all pending reads
> +		 * by the IO helper have already completed, which means we need
> +		 * to end the folio read here. If ifs->read_bytes_pending != 0,
> +		 * the IO helper will end the folio read.
> +		 */
> +		end_read = !ifs->read_bytes_pending;
> +		if (end_read)
> +			uptodate = ifs_is_fully_uptodate(folio, ifs);
> +		spin_unlock_irq(&ifs->state_lock);
> +		if (end_read)
> +			folio_end_read(folio, uptodate);
> +	}
> +}
> +
>  static int iomap_read_folio_iter(struct iomap_iter *iter,
> -		struct iomap_read_folio_ctx *ctx)
> +		struct iomap_read_folio_ctx *ctx, size_t *bytes_pending)
>  {
>  	const struct iomap *iomap = &iter->iomap;
>  	loff_t pos = iter->pos;
> @@ -460,6 +500,9 @@ static int iomap_read_folio_iter(struct iomap_iter *iter,
>  			folio_zero_range(folio, poff, plen);
>  			iomap_set_range_uptodate(folio, poff, plen);
>  		} else {
> +			if (!*bytes_pending)
> +				iomap_read_init(folio);
> +			*bytes_pending += plen;
>  			iomap_bio_read_folio_range(iter, ctx, pos, plen);
>  		}
>  
> @@ -482,17 +525,18 @@ int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops)
>  	struct iomap_read_folio_ctx ctx = {
>  		.cur_folio	= folio,
>  	};
> +	size_t bytes_pending = 0;
>  	int ret;
>  
>  	trace_iomap_readpage(iter.inode, 1);
>  
>  	while ((ret = iomap_iter(&iter, ops)) > 0)
> -		iter.status = iomap_read_folio_iter(&iter, &ctx);
> +		iter.status = iomap_read_folio_iter(&iter, &ctx,
> +				&bytes_pending);
>  
>  	iomap_bio_submit_read(&ctx);
>  
> -	if (!ctx.cur_folio_in_bio)
> -		folio_unlock(folio);
> +	iomap_read_end(folio, bytes_pending);
>  
>  	/*
>  	 * Just like mpage_readahead and block_read_full_folio, we always
> @@ -504,24 +548,23 @@ int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops)
>  EXPORT_SYMBOL_GPL(iomap_read_folio);
>  
>  static int iomap_readahead_iter(struct iomap_iter *iter,
> -		struct iomap_read_folio_ctx *ctx)
> +		struct iomap_read_folio_ctx *ctx, size_t *cur_bytes_pending)
>  {
>  	int ret;
>  
>  	while (iomap_length(iter)) {
>  		if (ctx->cur_folio &&
>  		    offset_in_folio(ctx->cur_folio, iter->pos) == 0) {
> -			if (!ctx->cur_folio_in_bio)
> -				folio_unlock(ctx->cur_folio);
> +			iomap_read_end(ctx->cur_folio, *cur_bytes_pending);
>  			ctx->cur_folio = NULL;
>  		}
>  		if (!ctx->cur_folio) {
>  			ctx->cur_folio = readahead_folio(ctx->rac);
>  			if (WARN_ON_ONCE(!ctx->cur_folio))
>  				return -EINVAL;
> -			ctx->cur_folio_in_bio = false;
> +			*cur_bytes_pending = 0;
>  		}
> -		ret = iomap_read_folio_iter(iter, ctx);
> +		ret = iomap_read_folio_iter(iter, ctx, cur_bytes_pending);
>  		if (ret)
>  			return ret;
>  	}
> @@ -554,16 +597,18 @@ void iomap_readahead(struct readahead_control *rac, const struct iomap_ops *ops)
>  	struct iomap_read_folio_ctx ctx = {
>  		.rac	= rac,
>  	};
> +	size_t cur_bytes_pending;
>  
>  	trace_iomap_readahead(rac->mapping->host, readahead_count(rac));
>  
>  	while (iomap_iter(&iter, ops) > 0)
> -		iter.status = iomap_readahead_iter(&iter, &ctx);
> +		iter.status = iomap_readahead_iter(&iter, &ctx,
> +					&cur_bytes_pending);
>  
>  	iomap_bio_submit_read(&ctx);
>  
> -	if (ctx.cur_folio && !ctx.cur_folio_in_bio)
> -		folio_unlock(ctx.cur_folio);
> +	if (ctx.cur_folio)
> +		iomap_read_end(ctx.cur_folio, cur_bytes_pending);
>  }
>  EXPORT_SYMBOL_GPL(iomap_readahead);
>  
> -- 
> 2.47.3
> 
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 07/14] iomap: track pending read bytes more optimally
  2025-10-23 19:34   ` Brian Foster
@ 2025-10-24  0:01     ` Joanne Koong
  2025-10-24 16:25       ` Joanne Koong
  0 siblings, 1 reply; 28+ messages in thread
From: Joanne Koong @ 2025-10-24  0:01 UTC (permalink / raw)
  To: Brian Foster
  Cc: brauner, miklos, djwong, hch, hsiangkao, linux-block, gfs2,
	linux-fsdevel, kernel-team, linux-xfs, linux-doc

On Thu, Oct 23, 2025 at 12:30 PM Brian Foster <bfoster@redhat.com> wrote:
>
> On Thu, Sep 25, 2025 at 05:26:02PM -0700, Joanne Koong wrote:
> > Instead of incrementing read_bytes_pending for every folio range read in
> > (which requires acquiring the spinlock to do so), set read_bytes_pending
> > to the folio size when the first range is asynchronously read in, keep
> > track of how many bytes total are asynchronously read in, and adjust
> > read_bytes_pending accordingly after issuing requests to read in all the
> > necessary ranges.
> >
> > iomap_read_folio_ctx->cur_folio_in_bio can be removed since a non-zero
> > value for pending bytes necessarily indicates the folio is in the bio.
> >
> > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > Suggested-by: "Darrick J. Wong" <djwong@kernel.org>
> > ---
>
> Hi Joanne,
>
> I was throwing some extra testing at the vfs-6.19.iomap branch since the
> little merge conflict thing with iomap_iter_advance(). I end up hitting
> what appears to be a lockup on XFS with 1k FSB (-bsize=1k) running
> generic/051. It reproduces fairly reliably within a few iterations or so
> and seems to always stall during a read for a dedupe operation:
>
> task:fsstress        state:D stack:0     pid:12094 tgid:12094 ppid:12091  task_flags:0x400140 flags:0x00080003
> Call Trace:
>  <TASK>
>  __schedule+0x2fc/0x7a0
>  schedule+0x27/0x80
>  io_schedule+0x46/0x70
>  folio_wait_bit_common+0x12b/0x310
>  ? __pfx_wake_page_function+0x10/0x10
>  ? __pfx_xfs_vm_read_folio+0x10/0x10 [xfs]
>  filemap_read_folio+0x85/0xd0
>  ? __pfx_xfs_vm_read_folio+0x10/0x10 [xfs]
>  do_read_cache_folio+0x7c/0x1b0
>  vfs_dedupe_file_range_compare.constprop.0+0xaf/0x2d0
>  __generic_remap_file_range_prep+0x276/0x2a0
>  generic_remap_file_range_prep+0x10/0x20
>  xfs_reflink_remap_prep+0x22c/0x300 [xfs]
>  xfs_file_remap_range+0x84/0x360 [xfs]
>  vfs_dedupe_file_range_one+0x1b2/0x1d0
>  ? remap_verify_area+0x46/0x140
>  vfs_dedupe_file_range+0x162/0x220
>  do_vfs_ioctl+0x4d1/0x940
>  __x64_sys_ioctl+0x75/0xe0
>  do_syscall_64+0x84/0x800
>  ? do_syscall_64+0xbb/0x800
>  ? avc_has_perm_noaudit+0x6b/0xf0
>  ? _copy_to_user+0x31/0x40
>  ? cp_new_stat+0x130/0x170
>  ? __do_sys_newfstat+0x44/0x70
>  ? do_syscall_64+0xbb/0x800
>  ? do_syscall_64+0xbb/0x800
>  ? clear_bhb_loop+0x30/0x80
>  ? clear_bhb_loop+0x30/0x80
>  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> RIP: 0033:0x7fe6bbd9a14d
> RSP: 002b:00007ffde72cd4e0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> RAX: ffffffffffffffda RBX: 0000000000000068 RCX: 00007fe6bbd9a14d
> RDX: 000000000a1394b0 RSI: 00000000c0189436 RDI: 0000000000000004
> RBP: 00007ffde72cd530 R08: 0000000000001000 R09: 000000000a11a3fc
> R10: 000000000001d6c0 R11: 0000000000000246 R12: 000000000a12cfb0
> R13: 000000000a12ba10 R14: 000000000a14e610 R15: 0000000000019000
>  </TASK>
>
> It wasn't immediately clear to me what the issue was so I bisected and
> it landed on this patch. It kind of looks like we're failing to unlock a
> folio at some point and then tripping over it later..? I can kill the
> fsstress process but then the umount ultimately gets stuck tossing
> pagecache [1], so the mount still ends up stuck indefinitely. Anyways,
> I'll poke at it some more but I figure you might be able to make sense
> of this faster than I can.
>
> Brian

Hi Brian,

Thanks for your report and the repro instructions. I will look into
this and report back what I find.

Thanks,
Joanne
>
> [1] umount stack trace:
>
> task:umount          state:D stack:0     pid:12216 tgid:12216 ppid:2514   task_flags:0x400100 flags:0x00080001
> Call Trace:
>  <TASK>
>  __schedule+0x2fc/0x7a0
>  schedule+0x27/0x80
>  io_schedule+0x46/0x70
>  folio_wait_bit_common+0x12b/0x310
>  ? __pfx_wake_page_function+0x10/0x10
>  truncate_inode_pages_range+0x42a/0x4d0
>  xfs_fs_evict_inode+0x1f/0x30 [xfs]
>  evict+0x112/0x290
>  evict_inodes+0x209/0x230
>  generic_shutdown_super+0x42/0x100
>  kill_block_super+0x1a/0x40
>  xfs_kill_sb+0x12/0x20 [xfs]
>  deactivate_locked_super+0x33/0xb0
>  cleanup_mnt+0xba/0x150
>  task_work_run+0x5c/0x90
>  exit_to_user_mode_loop+0x12f/0x170
>  do_syscall_64+0x1af/0x800
>  ? vfs_statx+0x80/0x160
>  ? do_statx+0x62/0xa0
>  ? __x64_sys_statx+0xaf/0x100
>  ? do_syscall_64+0xbb/0x800
>  ? __x64_sys_statx+0xaf/0x100
>  ? do_syscall_64+0xbb/0x800
>  ? count_memcg_events+0xdd/0x1b0
>  ? handle_mm_fault+0x220/0x340
>  ? do_user_addr_fault+0x2c3/0x7f0
>  ? clear_bhb_loop+0x30/0x80
>  ? clear_bhb_loop+0x30/0x80
>  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> RIP: 0033:0x7fdd641ed5ab
> RSP: 002b:00007ffd671182e8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
> RAX: 0000000000000000 RBX: 0000559b3e2056b0 RCX: 00007fdd641ed5ab
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000559b3e205ac0
> RBP: 00007ffd671183c0 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000103 R11: 0000000000000246 R12: 0000559b3e2057b8
> R13: 0000000000000000 R14: 0000559b3e205ac0 R15: 0000000000000000
>  </TASK>
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 07/14] iomap: track pending read bytes more optimally
  2025-10-24  0:01     ` Joanne Koong
@ 2025-10-24 16:25       ` Joanne Koong
  2025-10-24 17:14         ` Brian Foster
  2025-10-24 17:21         ` Matthew Wilcox
  0 siblings, 2 replies; 28+ messages in thread
From: Joanne Koong @ 2025-10-24 16:25 UTC (permalink / raw)
  To: Brian Foster
  Cc: brauner, miklos, djwong, hch, hsiangkao, linux-block, gfs2,
	linux-fsdevel, kernel-team, linux-xfs, linux-doc

On Thu, Oct 23, 2025 at 5:01 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Thu, Oct 23, 2025 at 12:30 PM Brian Foster <bfoster@redhat.com> wrote:
> >
> > On Thu, Sep 25, 2025 at 05:26:02PM -0700, Joanne Koong wrote:
> > > Instead of incrementing read_bytes_pending for every folio range read in
> > > (which requires acquiring the spinlock to do so), set read_bytes_pending
> > > to the folio size when the first range is asynchronously read in, keep
> > > track of how many bytes total are asynchronously read in, and adjust
> > > read_bytes_pending accordingly after issuing requests to read in all the
> > > necessary ranges.
> > >
> > > iomap_read_folio_ctx->cur_folio_in_bio can be removed since a non-zero
> > > value for pending bytes necessarily indicates the folio is in the bio.
> > >
> > > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > > Suggested-by: "Darrick J. Wong" <djwong@kernel.org>
> > > ---
> >
> > Hi Joanne,
> >
> > I was throwing some extra testing at the vfs-6.19.iomap branch since the
> > little merge conflict thing with iomap_iter_advance(). I end up hitting
> > what appears to be a lockup on XFS with 1k FSB (-bsize=1k) running
> > generic/051. It reproduces fairly reliably within a few iterations or so
> > and seems to always stall during a read for a dedupe operation:
> >
> > task:fsstress        state:D stack:0     pid:12094 tgid:12094 ppid:12091  task_flags:0x400140 flags:0x00080003
> > Call Trace:
> >  <TASK>
> >  __schedule+0x2fc/0x7a0
> >  schedule+0x27/0x80
> >  io_schedule+0x46/0x70
> >  folio_wait_bit_common+0x12b/0x310
> >  ? __pfx_wake_page_function+0x10/0x10
> >  ? __pfx_xfs_vm_read_folio+0x10/0x10 [xfs]
> >  filemap_read_folio+0x85/0xd0
> >  ? __pfx_xfs_vm_read_folio+0x10/0x10 [xfs]
> >  do_read_cache_folio+0x7c/0x1b0
> >  vfs_dedupe_file_range_compare.constprop.0+0xaf/0x2d0
> >  __generic_remap_file_range_prep+0x276/0x2a0
> >  generic_remap_file_range_prep+0x10/0x20
> >  xfs_reflink_remap_prep+0x22c/0x300 [xfs]
> >  xfs_file_remap_range+0x84/0x360 [xfs]
> >  vfs_dedupe_file_range_one+0x1b2/0x1d0
> >  ? remap_verify_area+0x46/0x140
> >  vfs_dedupe_file_range+0x162/0x220
> >  do_vfs_ioctl+0x4d1/0x940
> >  __x64_sys_ioctl+0x75/0xe0
> >  do_syscall_64+0x84/0x800
> >  ? do_syscall_64+0xbb/0x800
> >  ? avc_has_perm_noaudit+0x6b/0xf0
> >  ? _copy_to_user+0x31/0x40
> >  ? cp_new_stat+0x130/0x170
> >  ? __do_sys_newfstat+0x44/0x70
> >  ? do_syscall_64+0xbb/0x800
> >  ? do_syscall_64+0xbb/0x800
> >  ? clear_bhb_loop+0x30/0x80
> >  ? clear_bhb_loop+0x30/0x80
> >  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > RIP: 0033:0x7fe6bbd9a14d
> > RSP: 002b:00007ffde72cd4e0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> > RAX: ffffffffffffffda RBX: 0000000000000068 RCX: 00007fe6bbd9a14d
> > RDX: 000000000a1394b0 RSI: 00000000c0189436 RDI: 0000000000000004
> > RBP: 00007ffde72cd530 R08: 0000000000001000 R09: 000000000a11a3fc
> > R10: 000000000001d6c0 R11: 0000000000000246 R12: 000000000a12cfb0
> > R13: 000000000a12ba10 R14: 000000000a14e610 R15: 0000000000019000
> >  </TASK>
> >
> > It wasn't immediately clear to me what the issue was so I bisected and
> > it landed on this patch. It kind of looks like we're failing to unlock a
> > folio at some point and then tripping over it later..? I can kill the
> > fsstress process but then the umount ultimately gets stuck tossing
> > pagecache [1], so the mount still ends up stuck indefinitely. Anyways,
> > I'll poke at it some more but I figure you might be able to make sense
> > of this faster than I can.
> >
> > Brian
>
> Hi Brian,
>
> Thanks for your report and the repro instructions. I will look into
> this and report back what I find.

This is the fix:

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 4e6258fdb915..aa46fec8362d 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -445,6 +445,9 @@ static void iomap_read_end(struct folio *folio,
size_t bytes_pending)
                bool end_read, uptodate;
                size_t bytes_accounted = folio_size(folio) - bytes_pending;

+               if (!bytes_accounted)
+                       return;
+
                spin_lock_irq(&ifs->state_lock);


What I missed was that if all the bytes in the folio are non-uptodate
and need to read in by the filesystem, then there's a bug where the
read will be ended on the folio twice (in iomap_read_end() and when
the filesystem calls iomap_finish_folio_write(), when only the
filesystem should end the read), which does 2 folio unlocks which ends
up locking the folio. Looking at the writeback patch that does a
similar optimization [1], I miss the same thing there.

I'll fix up both. Thanks for catching this and bisecting it down to
this patch. Sorry for the trouble.

Thanks,
Joanne

[1] https://lore.kernel.org/linux-fsdevel/20251009225611.3744728-4-joannelkoong@gmail.com/
>
> Thanks,
> Joanne
> >

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 07/14] iomap: track pending read bytes more optimally
  2025-10-24 16:25       ` Joanne Koong
@ 2025-10-24 17:14         ` Brian Foster
  2025-10-24 19:48           ` Joanne Koong
  2025-10-24 17:21         ` Matthew Wilcox
  1 sibling, 1 reply; 28+ messages in thread
From: Brian Foster @ 2025-10-24 17:14 UTC (permalink / raw)
  To: Joanne Koong
  Cc: brauner, miklos, djwong, hch, hsiangkao, linux-block, gfs2,
	linux-fsdevel, kernel-team, linux-xfs, linux-doc

On Fri, Oct 24, 2025 at 09:25:13AM -0700, Joanne Koong wrote:
> On Thu, Oct 23, 2025 at 5:01 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > On Thu, Oct 23, 2025 at 12:30 PM Brian Foster <bfoster@redhat.com> wrote:
> > >
> > > On Thu, Sep 25, 2025 at 05:26:02PM -0700, Joanne Koong wrote:
> > > > Instead of incrementing read_bytes_pending for every folio range read in
> > > > (which requires acquiring the spinlock to do so), set read_bytes_pending
> > > > to the folio size when the first range is asynchronously read in, keep
> > > > track of how many bytes total are asynchronously read in, and adjust
> > > > read_bytes_pending accordingly after issuing requests to read in all the
> > > > necessary ranges.
> > > >
> > > > iomap_read_folio_ctx->cur_folio_in_bio can be removed since a non-zero
> > > > value for pending bytes necessarily indicates the folio is in the bio.
> > > >
> > > > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > > > Suggested-by: "Darrick J. Wong" <djwong@kernel.org>
> > > > ---
> > >
> > > Hi Joanne,
> > >
> > > I was throwing some extra testing at the vfs-6.19.iomap branch since the
> > > little merge conflict thing with iomap_iter_advance(). I end up hitting
> > > what appears to be a lockup on XFS with 1k FSB (-bsize=1k) running
> > > generic/051. It reproduces fairly reliably within a few iterations or so
> > > and seems to always stall during a read for a dedupe operation:
> > >
> > > task:fsstress        state:D stack:0     pid:12094 tgid:12094 ppid:12091  task_flags:0x400140 flags:0x00080003
> > > Call Trace:
> > >  <TASK>
> > >  __schedule+0x2fc/0x7a0
> > >  schedule+0x27/0x80
> > >  io_schedule+0x46/0x70
> > >  folio_wait_bit_common+0x12b/0x310
> > >  ? __pfx_wake_page_function+0x10/0x10
> > >  ? __pfx_xfs_vm_read_folio+0x10/0x10 [xfs]
> > >  filemap_read_folio+0x85/0xd0
> > >  ? __pfx_xfs_vm_read_folio+0x10/0x10 [xfs]
> > >  do_read_cache_folio+0x7c/0x1b0
> > >  vfs_dedupe_file_range_compare.constprop.0+0xaf/0x2d0
> > >  __generic_remap_file_range_prep+0x276/0x2a0
> > >  generic_remap_file_range_prep+0x10/0x20
> > >  xfs_reflink_remap_prep+0x22c/0x300 [xfs]
> > >  xfs_file_remap_range+0x84/0x360 [xfs]
> > >  vfs_dedupe_file_range_one+0x1b2/0x1d0
> > >  ? remap_verify_area+0x46/0x140
> > >  vfs_dedupe_file_range+0x162/0x220
> > >  do_vfs_ioctl+0x4d1/0x940
> > >  __x64_sys_ioctl+0x75/0xe0
> > >  do_syscall_64+0x84/0x800
> > >  ? do_syscall_64+0xbb/0x800
> > >  ? avc_has_perm_noaudit+0x6b/0xf0
> > >  ? _copy_to_user+0x31/0x40
> > >  ? cp_new_stat+0x130/0x170
> > >  ? __do_sys_newfstat+0x44/0x70
> > >  ? do_syscall_64+0xbb/0x800
> > >  ? do_syscall_64+0xbb/0x800
> > >  ? clear_bhb_loop+0x30/0x80
> > >  ? clear_bhb_loop+0x30/0x80
> > >  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > > RIP: 0033:0x7fe6bbd9a14d
> > > RSP: 002b:00007ffde72cd4e0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> > > RAX: ffffffffffffffda RBX: 0000000000000068 RCX: 00007fe6bbd9a14d
> > > RDX: 000000000a1394b0 RSI: 00000000c0189436 RDI: 0000000000000004
> > > RBP: 00007ffde72cd530 R08: 0000000000001000 R09: 000000000a11a3fc
> > > R10: 000000000001d6c0 R11: 0000000000000246 R12: 000000000a12cfb0
> > > R13: 000000000a12ba10 R14: 000000000a14e610 R15: 0000000000019000
> > >  </TASK>
> > >
> > > It wasn't immediately clear to me what the issue was so I bisected and
> > > it landed on this patch. It kind of looks like we're failing to unlock a
> > > folio at some point and then tripping over it later..? I can kill the
> > > fsstress process but then the umount ultimately gets stuck tossing
> > > pagecache [1], so the mount still ends up stuck indefinitely. Anyways,
> > > I'll poke at it some more but I figure you might be able to make sense
> > > of this faster than I can.
> > >
> > > Brian
> >
> > Hi Brian,
> >
> > Thanks for your report and the repro instructions. I will look into
> > this and report back what I find.
> 
> This is the fix:
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 4e6258fdb915..aa46fec8362d 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -445,6 +445,9 @@ static void iomap_read_end(struct folio *folio,
> size_t bytes_pending)
>                 bool end_read, uptodate;
>                 size_t bytes_accounted = folio_size(folio) - bytes_pending;
> 
> +               if (!bytes_accounted)
> +                       return;
> +
>                 spin_lock_irq(&ifs->state_lock);
> 
> 
> What I missed was that if all the bytes in the folio are non-uptodate
> and need to read in by the filesystem, then there's a bug where the
> read will be ended on the folio twice (in iomap_read_end() and when
> the filesystem calls iomap_finish_folio_write(), when only the
> filesystem should end the read), which does 2 folio unlocks which ends
> up locking the folio. Looking at the writeback patch that does a
> similar optimization [1], I miss the same thing there.
> 

Makes sense.. though a short comment wouldn't hurt in there. ;) I found
myself a little confused by the accounted vs. pending naming when
reading through that code. If I follow correctly, the intent is to refer
to the additional bytes accounted to read_bytes_pending via the init
(where it just accounts the whole folio up front) and pending refers to
submitted I/O.

Presumably that extra accounting doubly serves as the typical "don't
complete the op before the submitter is done processing" extra
reference, except in this full submit case of course. If so, that's
subtle enough in my mind that a sentence or two on it wouldn't hurt..

> I'll fix up both. Thanks for catching this and bisecting it down to
> this patch. Sorry for the trouble.
> 

No prob. Thanks for the fix!

Brian

> Thanks,
> Joanne
> 
> [1] https://lore.kernel.org/linux-fsdevel/20251009225611.3744728-4-joannelkoong@gmail.com/
> >
> > Thanks,
> > Joanne
> > >
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 07/14] iomap: track pending read bytes more optimally
  2025-10-24 17:14         ` Brian Foster
@ 2025-10-24 19:48           ` Joanne Koong
  2025-10-24 21:55             ` Joanne Koong
  0 siblings, 1 reply; 28+ messages in thread
From: Joanne Koong @ 2025-10-24 19:48 UTC (permalink / raw)
  To: Brian Foster
  Cc: brauner, miklos, djwong, hch, hsiangkao, linux-block, gfs2,
	linux-fsdevel, kernel-team, linux-xfs, linux-doc

On Fri, Oct 24, 2025 at 10:10 AM Brian Foster <bfoster@redhat.com> wrote:
>
> On Fri, Oct 24, 2025 at 09:25:13AM -0700, Joanne Koong wrote:
> > On Thu, Oct 23, 2025 at 5:01 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > >
> > > On Thu, Oct 23, 2025 at 12:30 PM Brian Foster <bfoster@redhat.com> wrote:
> > > >
> > > > On Thu, Sep 25, 2025 at 05:26:02PM -0700, Joanne Koong wrote:
> > > > > Instead of incrementing read_bytes_pending for every folio range read in
> > > > > (which requires acquiring the spinlock to do so), set read_bytes_pending
> > > > > to the folio size when the first range is asynchronously read in, keep
> > > > > track of how many bytes total are asynchronously read in, and adjust
> > > > > read_bytes_pending accordingly after issuing requests to read in all the
> > > > > necessary ranges.
> > > > >
> > > > > iomap_read_folio_ctx->cur_folio_in_bio can be removed since a non-zero
> > > > > value for pending bytes necessarily indicates the folio is in the bio.
> > > > >
> > > > > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > > > > Suggested-by: "Darrick J. Wong" <djwong@kernel.org>
> > > > > ---
> > > >
> > > > Hi Joanne,
> > > >
> > > > I was throwing some extra testing at the vfs-6.19.iomap branch since the
> > > > little merge conflict thing with iomap_iter_advance(). I end up hitting
> > > > what appears to be a lockup on XFS with 1k FSB (-bsize=1k) running
> > > > generic/051. It reproduces fairly reliably within a few iterations or so
> > > > and seems to always stall during a read for a dedupe operation:
> > > >
> > > > task:fsstress        state:D stack:0     pid:12094 tgid:12094 ppid:12091  task_flags:0x400140 flags:0x00080003
> > > > Call Trace:
> > > >  <TASK>
> > > >  __schedule+0x2fc/0x7a0
> > > >  schedule+0x27/0x80
> > > >  io_schedule+0x46/0x70
> > > >  folio_wait_bit_common+0x12b/0x310
> > > >  ? __pfx_wake_page_function+0x10/0x10
> > > >  ? __pfx_xfs_vm_read_folio+0x10/0x10 [xfs]
> > > >  filemap_read_folio+0x85/0xd0
> > > >  ? __pfx_xfs_vm_read_folio+0x10/0x10 [xfs]
> > > >  do_read_cache_folio+0x7c/0x1b0
> > > >  vfs_dedupe_file_range_compare.constprop.0+0xaf/0x2d0
> > > >  __generic_remap_file_range_prep+0x276/0x2a0
> > > >  generic_remap_file_range_prep+0x10/0x20
> > > >  xfs_reflink_remap_prep+0x22c/0x300 [xfs]
> > > >  xfs_file_remap_range+0x84/0x360 [xfs]
> > > >  vfs_dedupe_file_range_one+0x1b2/0x1d0
> > > >  ? remap_verify_area+0x46/0x140
> > > >  vfs_dedupe_file_range+0x162/0x220
> > > >  do_vfs_ioctl+0x4d1/0x940
> > > >  __x64_sys_ioctl+0x75/0xe0
> > > >  do_syscall_64+0x84/0x800
> > > >  ? do_syscall_64+0xbb/0x800
> > > >  ? avc_has_perm_noaudit+0x6b/0xf0
> > > >  ? _copy_to_user+0x31/0x40
> > > >  ? cp_new_stat+0x130/0x170
> > > >  ? __do_sys_newfstat+0x44/0x70
> > > >  ? do_syscall_64+0xbb/0x800
> > > >  ? do_syscall_64+0xbb/0x800
> > > >  ? clear_bhb_loop+0x30/0x80
> > > >  ? clear_bhb_loop+0x30/0x80
> > > >  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > > > RIP: 0033:0x7fe6bbd9a14d
> > > > RSP: 002b:00007ffde72cd4e0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> > > > RAX: ffffffffffffffda RBX: 0000000000000068 RCX: 00007fe6bbd9a14d
> > > > RDX: 000000000a1394b0 RSI: 00000000c0189436 RDI: 0000000000000004
> > > > RBP: 00007ffde72cd530 R08: 0000000000001000 R09: 000000000a11a3fc
> > > > R10: 000000000001d6c0 R11: 0000000000000246 R12: 000000000a12cfb0
> > > > R13: 000000000a12ba10 R14: 000000000a14e610 R15: 0000000000019000
> > > >  </TASK>
> > > >
> > > > It wasn't immediately clear to me what the issue was so I bisected and
> > > > it landed on this patch. It kind of looks like we're failing to unlock a
> > > > folio at some point and then tripping over it later..? I can kill the
> > > > fsstress process but then the umount ultimately gets stuck tossing
> > > > pagecache [1], so the mount still ends up stuck indefinitely. Anyways,
> > > > I'll poke at it some more but I figure you might be able to make sense
> > > > of this faster than I can.
> > > >
> > > > Brian
> > >
> > > Hi Brian,
> > >
> > > Thanks for your report and the repro instructions. I will look into
> > > this and report back what I find.
> >
> > This is the fix:
> >
> > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > index 4e6258fdb915..aa46fec8362d 100644
> > --- a/fs/iomap/buffered-io.c
> > +++ b/fs/iomap/buffered-io.c
> > @@ -445,6 +445,9 @@ static void iomap_read_end(struct folio *folio,
> > size_t bytes_pending)
> >                 bool end_read, uptodate;
> >                 size_t bytes_accounted = folio_size(folio) - bytes_pending;
> >
> > +               if (!bytes_accounted)
> > +                       return;
> > +
> >                 spin_lock_irq(&ifs->state_lock);
> >
> >
> > What I missed was that if all the bytes in the folio are non-uptodate
> > and need to read in by the filesystem, then there's a bug where the
> > read will be ended on the folio twice (in iomap_read_end() and when
> > the filesystem calls iomap_finish_folio_write(), when only the
> > filesystem should end the read), which does 2 folio unlocks which ends
> > up locking the folio. Looking at the writeback patch that does a
> > similar optimization [1], I miss the same thing there.
> >
>
> Makes sense.. though a short comment wouldn't hurt in there. ;) I found
> myself a little confused by the accounted vs. pending naming when
> reading through that code. If I follow correctly, the intent is to refer
> to the additional bytes accounted to read_bytes_pending via the init
> (where it just accounts the whole folio up front) and pending refers to
> submitted I/O.
>
> Presumably that extra accounting doubly serves as the typical "don't
> complete the op before the submitter is done processing" extra
> reference, except in this full submit case of course. If so, that's
> subtle enough in my mind that a sentence or two on it wouldn't hurt..

I will add some a comment about this :) That's a good point about the
naming, maybe "bytes_submitted" and "bytes_unsubmitted" is a lot less
confusing than "bytes_pending" and "bytes_accounted".

Thanks,
Joanne

>
> > I'll fix up both. Thanks for catching this and bisecting it down to
> > this patch. Sorry for the trouble.
> >
>
> No prob. Thanks for the fix!
>
> Brian
>
> > Thanks,
> > Joanne
> >
> > [1] https://lore.kernel.org/linux-fsdevel/20251009225611.3744728-4-joannelkoong@gmail.com/
> > >
> > > Thanks,
> > > Joanne
> > > >
> >
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 07/14] iomap: track pending read bytes more optimally
  2025-10-24 19:48           ` Joanne Koong
@ 2025-10-24 21:55             ` Joanne Koong
  2025-10-27 12:16               ` Brian Foster
  0 siblings, 1 reply; 28+ messages in thread
From: Joanne Koong @ 2025-10-24 21:55 UTC (permalink / raw)
  To: Brian Foster
  Cc: brauner, miklos, djwong, hch, hsiangkao, linux-block, gfs2,
	linux-fsdevel, kernel-team, linux-xfs, linux-doc

On Fri, Oct 24, 2025 at 12:48 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Fri, Oct 24, 2025 at 10:10 AM Brian Foster <bfoster@redhat.com> wrote:
> >
> > On Fri, Oct 24, 2025 at 09:25:13AM -0700, Joanne Koong wrote:
> > > On Thu, Oct 23, 2025 at 5:01 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > > >
> > > > On Thu, Oct 23, 2025 at 12:30 PM Brian Foster <bfoster@redhat.com> wrote:
> > > > >
> > > > > On Thu, Sep 25, 2025 at 05:26:02PM -0700, Joanne Koong wrote:
> > > > > > Instead of incrementing read_bytes_pending for every folio range read in
> > > > > > (which requires acquiring the spinlock to do so), set read_bytes_pending
> > > > > > to the folio size when the first range is asynchronously read in, keep
> > > > > > track of how many bytes total are asynchronously read in, and adjust
> > > > > > read_bytes_pending accordingly after issuing requests to read in all the
> > > > > > necessary ranges.
> > > > > >
> > > > > > iomap_read_folio_ctx->cur_folio_in_bio can be removed since a non-zero
> > > > > > value for pending bytes necessarily indicates the folio is in the bio.
> > > > > >
> > > > > > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > > > > > Suggested-by: "Darrick J. Wong" <djwong@kernel.org>
> > > > > > ---
> > > > >
> > > > > Hi Joanne,
> > > > >
> > > > > I was throwing some extra testing at the vfs-6.19.iomap branch since the
> > > > > little merge conflict thing with iomap_iter_advance(). I end up hitting
> > > > > what appears to be a lockup on XFS with 1k FSB (-bsize=1k) running
> > > > > generic/051. It reproduces fairly reliably within a few iterations or so
> > > > > and seems to always stall during a read for a dedupe operation:
> > > > >
> > > > > task:fsstress        state:D stack:0     pid:12094 tgid:12094 ppid:12091  task_flags:0x400140 flags:0x00080003
> > > > > Call Trace:
> > > > >  <TASK>
> > > > >  __schedule+0x2fc/0x7a0
> > > > >  schedule+0x27/0x80
> > > > >  io_schedule+0x46/0x70
> > > > >  folio_wait_bit_common+0x12b/0x310
> > > > >  ? __pfx_wake_page_function+0x10/0x10
> > > > >  ? __pfx_xfs_vm_read_folio+0x10/0x10 [xfs]
> > > > >  filemap_read_folio+0x85/0xd0
> > > > >  ? __pfx_xfs_vm_read_folio+0x10/0x10 [xfs]
> > > > >  do_read_cache_folio+0x7c/0x1b0
> > > > >  vfs_dedupe_file_range_compare.constprop.0+0xaf/0x2d0
> > > > >  __generic_remap_file_range_prep+0x276/0x2a0
> > > > >  generic_remap_file_range_prep+0x10/0x20
> > > > >  xfs_reflink_remap_prep+0x22c/0x300 [xfs]
> > > > >  xfs_file_remap_range+0x84/0x360 [xfs]
> > > > >  vfs_dedupe_file_range_one+0x1b2/0x1d0
> > > > >  ? remap_verify_area+0x46/0x140
> > > > >  vfs_dedupe_file_range+0x162/0x220
> > > > >  do_vfs_ioctl+0x4d1/0x940
> > > > >  __x64_sys_ioctl+0x75/0xe0
> > > > >  do_syscall_64+0x84/0x800
> > > > >  ? do_syscall_64+0xbb/0x800
> > > > >  ? avc_has_perm_noaudit+0x6b/0xf0
> > > > >  ? _copy_to_user+0x31/0x40
> > > > >  ? cp_new_stat+0x130/0x170
> > > > >  ? __do_sys_newfstat+0x44/0x70
> > > > >  ? do_syscall_64+0xbb/0x800
> > > > >  ? do_syscall_64+0xbb/0x800
> > > > >  ? clear_bhb_loop+0x30/0x80
> > > > >  ? clear_bhb_loop+0x30/0x80
> > > > >  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > > > > RIP: 0033:0x7fe6bbd9a14d
> > > > > RSP: 002b:00007ffde72cd4e0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> > > > > RAX: ffffffffffffffda RBX: 0000000000000068 RCX: 00007fe6bbd9a14d
> > > > > RDX: 000000000a1394b0 RSI: 00000000c0189436 RDI: 0000000000000004
> > > > > RBP: 00007ffde72cd530 R08: 0000000000001000 R09: 000000000a11a3fc
> > > > > R10: 000000000001d6c0 R11: 0000000000000246 R12: 000000000a12cfb0
> > > > > R13: 000000000a12ba10 R14: 000000000a14e610 R15: 0000000000019000
> > > > >  </TASK>
> > > > >
> > > > > It wasn't immediately clear to me what the issue was so I bisected and
> > > > > it landed on this patch. It kind of looks like we're failing to unlock a
> > > > > folio at some point and then tripping over it later..? I can kill the
> > > > > fsstress process but then the umount ultimately gets stuck tossing
> > > > > pagecache [1], so the mount still ends up stuck indefinitely. Anyways,
> > > > > I'll poke at it some more but I figure you might be able to make sense
> > > > > of this faster than I can.
> > > > >
> > > > > Brian
> > > >
> > > > Hi Brian,
> > > >
> > > > Thanks for your report and the repro instructions. I will look into
> > > > this and report back what I find.
> > >
> > > This is the fix:
> > >
> > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > > index 4e6258fdb915..aa46fec8362d 100644
> > > --- a/fs/iomap/buffered-io.c
> > > +++ b/fs/iomap/buffered-io.c
> > > @@ -445,6 +445,9 @@ static void iomap_read_end(struct folio *folio,
> > > size_t bytes_pending)
> > >                 bool end_read, uptodate;
> > >                 size_t bytes_accounted = folio_size(folio) - bytes_pending;
> > >
> > > +               if (!bytes_accounted)
> > > +                       return;
> > > +
> > >                 spin_lock_irq(&ifs->state_lock);
> > >
> > >
> > > What I missed was that if all the bytes in the folio are non-uptodate
> > > and need to read in by the filesystem, then there's a bug where the
> > > read will be ended on the folio twice (in iomap_read_end() and when
> > > the filesystem calls iomap_finish_folio_write(), when only the
> > > filesystem should end the read), which does 2 folio unlocks which ends
> > > up locking the folio. Looking at the writeback patch that does a
> > > similar optimization [1], I miss the same thing there.
> > >
> >
> > Makes sense.. though a short comment wouldn't hurt in there. ;) I found
> > myself a little confused by the accounted vs. pending naming when
> > reading through that code. If I follow correctly, the intent is to refer
> > to the additional bytes accounted to read_bytes_pending via the init
> > (where it just accounts the whole folio up front) and pending refers to
> > submitted I/O.
> >
> > Presumably that extra accounting doubly serves as the typical "don't
> > complete the op before the submitter is done processing" extra
> > reference, except in this full submit case of course. If so, that's
> > subtle enough in my mind that a sentence or two on it wouldn't hurt..
>
> I will add some a comment about this :) That's a good point about the
> naming, maybe "bytes_submitted" and "bytes_unsubmitted" is a lot less
> confusing than "bytes_pending" and "bytes_accounted".

Thinking about this some more, bytes_unsubmitted sounds even more
confusing, so maybe bytes_nonsubmitted or bytes_not_submitted. I'll
think about this some more but kept it as pending/accounted for now.

The fix for this bug is here [1].

Thanks,
Joanne

[1] https://lore.kernel.org/linux-fsdevel/20251024215008.3844068-1-joannelkoong@gmail.com/

>
> Thanks,
> Joanne
>
> >
> > > I'll fix up both. Thanks for catching this and bisecting it down to
> > > this patch. Sorry for the trouble.
> > >
> >
> > No prob. Thanks for the fix!
> >
> > Brian
> >
> > > Thanks,
> > > Joanne
> > >
> > > [1] https://lore.kernel.org/linux-fsdevel/20251009225611.3744728-4-joannelkoong@gmail.com/
> > > >
> > > > Thanks,
> > > > Joanne
> > > > >
> > >
> >

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 07/14] iomap: track pending read bytes more optimally
  2025-10-24 21:55             ` Joanne Koong
@ 2025-10-27 12:16               ` Brian Foster
  0 siblings, 0 replies; 28+ messages in thread
From: Brian Foster @ 2025-10-27 12:16 UTC (permalink / raw)
  To: Joanne Koong
  Cc: brauner, miklos, djwong, hch, hsiangkao, linux-block, gfs2,
	linux-fsdevel, kernel-team, linux-xfs, linux-doc

On Fri, Oct 24, 2025 at 02:55:20PM -0700, Joanne Koong wrote:
> On Fri, Oct 24, 2025 at 12:48 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > On Fri, Oct 24, 2025 at 10:10 AM Brian Foster <bfoster@redhat.com> wrote:
> > >
> > > On Fri, Oct 24, 2025 at 09:25:13AM -0700, Joanne Koong wrote:
> > > > On Thu, Oct 23, 2025 at 5:01 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > > > >
> > > > > On Thu, Oct 23, 2025 at 12:30 PM Brian Foster <bfoster@redhat.com> wrote:
> > > > > >
> > > > > > On Thu, Sep 25, 2025 at 05:26:02PM -0700, Joanne Koong wrote:
> > > > > > > Instead of incrementing read_bytes_pending for every folio range read in
> > > > > > > (which requires acquiring the spinlock to do so), set read_bytes_pending
> > > > > > > to the folio size when the first range is asynchronously read in, keep
> > > > > > > track of how many bytes total are asynchronously read in, and adjust
> > > > > > > read_bytes_pending accordingly after issuing requests to read in all the
> > > > > > > necessary ranges.
> > > > > > >
> > > > > > > iomap_read_folio_ctx->cur_folio_in_bio can be removed since a non-zero
> > > > > > > value for pending bytes necessarily indicates the folio is in the bio.
> > > > > > >
> > > > > > > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > > > > > > Suggested-by: "Darrick J. Wong" <djwong@kernel.org>
> > > > > > > ---
> > > > > >
> > > > > > Hi Joanne,
> > > > > >
> > > > > > I was throwing some extra testing at the vfs-6.19.iomap branch since the
> > > > > > little merge conflict thing with iomap_iter_advance(). I end up hitting
> > > > > > what appears to be a lockup on XFS with 1k FSB (-bsize=1k) running
> > > > > > generic/051. It reproduces fairly reliably within a few iterations or so
> > > > > > and seems to always stall during a read for a dedupe operation:
> > > > > >
> > > > > > task:fsstress        state:D stack:0     pid:12094 tgid:12094 ppid:12091  task_flags:0x400140 flags:0x00080003
> > > > > > Call Trace:
> > > > > >  <TASK>
> > > > > >  __schedule+0x2fc/0x7a0
> > > > > >  schedule+0x27/0x80
> > > > > >  io_schedule+0x46/0x70
> > > > > >  folio_wait_bit_common+0x12b/0x310
> > > > > >  ? __pfx_wake_page_function+0x10/0x10
> > > > > >  ? __pfx_xfs_vm_read_folio+0x10/0x10 [xfs]
> > > > > >  filemap_read_folio+0x85/0xd0
> > > > > >  ? __pfx_xfs_vm_read_folio+0x10/0x10 [xfs]
> > > > > >  do_read_cache_folio+0x7c/0x1b0
> > > > > >  vfs_dedupe_file_range_compare.constprop.0+0xaf/0x2d0
> > > > > >  __generic_remap_file_range_prep+0x276/0x2a0
> > > > > >  generic_remap_file_range_prep+0x10/0x20
> > > > > >  xfs_reflink_remap_prep+0x22c/0x300 [xfs]
> > > > > >  xfs_file_remap_range+0x84/0x360 [xfs]
> > > > > >  vfs_dedupe_file_range_one+0x1b2/0x1d0
> > > > > >  ? remap_verify_area+0x46/0x140
> > > > > >  vfs_dedupe_file_range+0x162/0x220
> > > > > >  do_vfs_ioctl+0x4d1/0x940
> > > > > >  __x64_sys_ioctl+0x75/0xe0
> > > > > >  do_syscall_64+0x84/0x800
> > > > > >  ? do_syscall_64+0xbb/0x800
> > > > > >  ? avc_has_perm_noaudit+0x6b/0xf0
> > > > > >  ? _copy_to_user+0x31/0x40
> > > > > >  ? cp_new_stat+0x130/0x170
> > > > > >  ? __do_sys_newfstat+0x44/0x70
> > > > > >  ? do_syscall_64+0xbb/0x800
> > > > > >  ? do_syscall_64+0xbb/0x800
> > > > > >  ? clear_bhb_loop+0x30/0x80
> > > > > >  ? clear_bhb_loop+0x30/0x80
> > > > > >  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > > > > > RIP: 0033:0x7fe6bbd9a14d
> > > > > > RSP: 002b:00007ffde72cd4e0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> > > > > > RAX: ffffffffffffffda RBX: 0000000000000068 RCX: 00007fe6bbd9a14d
> > > > > > RDX: 000000000a1394b0 RSI: 00000000c0189436 RDI: 0000000000000004
> > > > > > RBP: 00007ffde72cd530 R08: 0000000000001000 R09: 000000000a11a3fc
> > > > > > R10: 000000000001d6c0 R11: 0000000000000246 R12: 000000000a12cfb0
> > > > > > R13: 000000000a12ba10 R14: 000000000a14e610 R15: 0000000000019000
> > > > > >  </TASK>
> > > > > >
> > > > > > It wasn't immediately clear to me what the issue was so I bisected and
> > > > > > it landed on this patch. It kind of looks like we're failing to unlock a
> > > > > > folio at some point and then tripping over it later..? I can kill the
> > > > > > fsstress process but then the umount ultimately gets stuck tossing
> > > > > > pagecache [1], so the mount still ends up stuck indefinitely. Anyways,
> > > > > > I'll poke at it some more but I figure you might be able to make sense
> > > > > > of this faster than I can.
> > > > > >
> > > > > > Brian
> > > > >
> > > > > Hi Brian,
> > > > >
> > > > > Thanks for your report and the repro instructions. I will look into
> > > > > this and report back what I find.
> > > >
> > > > This is the fix:
> > > >
> > > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > > > index 4e6258fdb915..aa46fec8362d 100644
> > > > --- a/fs/iomap/buffered-io.c
> > > > +++ b/fs/iomap/buffered-io.c
> > > > @@ -445,6 +445,9 @@ static void iomap_read_end(struct folio *folio,
> > > > size_t bytes_pending)
> > > >                 bool end_read, uptodate;
> > > >                 size_t bytes_accounted = folio_size(folio) - bytes_pending;
> > > >
> > > > +               if (!bytes_accounted)
> > > > +                       return;
> > > > +
> > > >                 spin_lock_irq(&ifs->state_lock);
> > > >
> > > >
> > > > What I missed was that if all the bytes in the folio are non-uptodate
> > > > and need to read in by the filesystem, then there's a bug where the
> > > > read will be ended on the folio twice (in iomap_read_end() and when
> > > > the filesystem calls iomap_finish_folio_write(), when only the
> > > > filesystem should end the read), which does 2 folio unlocks which ends
> > > > up locking the folio. Looking at the writeback patch that does a
> > > > similar optimization [1], I miss the same thing there.
> > > >
> > >
> > > Makes sense.. though a short comment wouldn't hurt in there. ;) I found
> > > myself a little confused by the accounted vs. pending naming when
> > > reading through that code. If I follow correctly, the intent is to refer
> > > to the additional bytes accounted to read_bytes_pending via the init
> > > (where it just accounts the whole folio up front) and pending refers to
> > > submitted I/O.
> > >
> > > Presumably that extra accounting doubly serves as the typical "don't
> > > complete the op before the submitter is done processing" extra
> > > reference, except in this full submit case of course. If so, that's
> > > subtle enough in my mind that a sentence or two on it wouldn't hurt..
> >
> > I will add some a comment about this :) That's a good point about the
> > naming, maybe "bytes_submitted" and "bytes_unsubmitted" is a lot less
> > confusing than "bytes_pending" and "bytes_accounted".
> 
> Thinking about this some more, bytes_unsubmitted sounds even more
> confusing, so maybe bytes_nonsubmitted or bytes_not_submitted. I'll
> think about this some more but kept it as pending/accounted for now.
> 

bytes_submitted sounds better than pending to me, not sure about
unsubmitted or whatever. As long as there's a sentence or two that
explains what accounted means in the end helper, though, that seems
reasonable enough to me.

Brian

> The fix for this bug is here [1].
> 
> Thanks,
> Joanne
> 
> [1] https://lore.kernel.org/linux-fsdevel/20251024215008.3844068-1-joannelkoong@gmail.com/
> 
> >
> > Thanks,
> > Joanne
> >
> > >
> > > > I'll fix up both. Thanks for catching this and bisecting it down to
> > > > this patch. Sorry for the trouble.
> > > >
> > >
> > > No prob. Thanks for the fix!
> > >
> > > Brian
> > >
> > > > Thanks,
> > > > Joanne
> > > >
> > > > [1] https://lore.kernel.org/linux-fsdevel/20251009225611.3744728-4-joannelkoong@gmail.com/
> > > > >
> > > > > Thanks,
> > > > > Joanne
> > > > > >
> > > >
> > >
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 07/14] iomap: track pending read bytes more optimally
  2025-10-24 16:25       ` Joanne Koong
  2025-10-24 17:14         ` Brian Foster
@ 2025-10-24 17:21         ` Matthew Wilcox
  2025-10-24 19:22           ` Joanne Koong
  1 sibling, 1 reply; 28+ messages in thread
From: Matthew Wilcox @ 2025-10-24 17:21 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Brian Foster, brauner, miklos, djwong, hch, hsiangkao,
	linux-block, gfs2, linux-fsdevel, kernel-team, linux-xfs,
	linux-doc

On Fri, Oct 24, 2025 at 09:25:13AM -0700, Joanne Koong wrote:
> What I missed was that if all the bytes in the folio are non-uptodate
> and need to read in by the filesystem, then there's a bug where the
> read will be ended on the folio twice (in iomap_read_end() and when
> the filesystem calls iomap_finish_folio_write(), when only the
> filesystem should end the read), which does 2 folio unlocks which ends
> up locking the folio. Looking at the writeback patch that does a
> similar optimization [1], I miss the same thing there.

folio_unlock() contains:
        VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);

Feels like more filesystem people should be enabling CONFIG_DEBUG_VM
when testing (excluding performance testing of course; it'll do ugly
things to your performance numbers).

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 07/14] iomap: track pending read bytes more optimally
  2025-10-24 17:21         ` Matthew Wilcox
@ 2025-10-24 19:22           ` Joanne Koong
  2025-10-24 20:59             ` Matthew Wilcox
  0 siblings, 1 reply; 28+ messages in thread
From: Joanne Koong @ 2025-10-24 19:22 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Brian Foster, brauner, miklos, djwong, hch, hsiangkao,
	linux-block, gfs2, linux-fsdevel, kernel-team, linux-xfs,
	linux-doc

On Fri, Oct 24, 2025 at 10:21 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Oct 24, 2025 at 09:25:13AM -0700, Joanne Koong wrote:
> > What I missed was that if all the bytes in the folio are non-uptodate
> > and need to read in by the filesystem, then there's a bug where the
> > read will be ended on the folio twice (in iomap_read_end() and when
> > the filesystem calls iomap_finish_folio_write(), when only the
> > filesystem should end the read), which does 2 folio unlocks which ends
> > up locking the folio. Looking at the writeback patch that does a
> > similar optimization [1], I miss the same thing there.
>
> folio_unlock() contains:
>         VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
>
> Feels like more filesystem people should be enabling CONFIG_DEBUG_VM
> when testing (excluding performance testing of course; it'll do ugly
> things to your performance numbers).

Point taken. It looks like there's a bunch of other memory debugging
configs as well. Do you recommend enabling all of these when testing?
Do you have a particular .config you use for when you run tests?

Thanks,
Joanne

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 07/14] iomap: track pending read bytes more optimally
  2025-10-24 19:22           ` Joanne Koong
@ 2025-10-24 20:59             ` Matthew Wilcox
  2025-10-24 21:37               ` Darrick J. Wong
  2025-10-24 21:58               ` Joanne Koong
  0 siblings, 2 replies; 28+ messages in thread
From: Matthew Wilcox @ 2025-10-24 20:59 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Brian Foster, brauner, miklos, djwong, hch, hsiangkao,
	linux-block, gfs2, linux-fsdevel, kernel-team, linux-xfs,
	linux-doc

On Fri, Oct 24, 2025 at 12:22:32PM -0700, Joanne Koong wrote:
> > Feels like more filesystem people should be enabling CONFIG_DEBUG_VM
> > when testing (excluding performance testing of course; it'll do ugly
> > things to your performance numbers).
> 
> Point taken. It looks like there's a bunch of other memory debugging
> configs as well. Do you recommend enabling all of these when testing?
> Do you have a particular .config you use for when you run tests?

Our Kconfig is far too ornate.  We could do with a "recommended for
kernel developers" profile.  Here's what I'm currently using, though I
know it's changed over time:

CONFIG_X86_DEBUGCTLMSR=y
CONFIG_PM_DEBUG=y
CONFIG_PM_SLEEP_DEBUG=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_BLK_DEBUG_FS=y
CONFIG_PNP_DEBUG_MESSAGES=y
CONFIG_SCSI_DEBUG=m
CONFIG_EXT4_DEBUG=y
CONFIG_JFS_DEBUG=y
CONFIG_XFS_DEBUG=y
CONFIG_BTRFS_DEBUG=y
CONFIG_UFS_DEBUG=y
CONFIG_DEBUG_BUGVERBOSE=y
CONFIG_DEBUG_KERNEL=y
CONFIG_DEBUG_MISC=y
CONFIG_DEBUG_INFO=y
CONFIG_DEBUG_INFO_DWARF4=y
CONFIG_DEBUG_INFO_COMPRESSED_NONE=y
CONFIG_DEBUG_FS=y
CONFIG_DEBUG_FS_ALLOW_ALL=y
CONFIG_ARCH_HAS_EARLY_DEBUG=y
CONFIG_SLUB_DEBUG=y
CONFIG_ARCH_HAS_DEBUG_WX=y
CONFIG_HAVE_DEBUG_KMEMLEAK=y
CONFIG_SHRINKER_DEBUG=y
CONFIG_ARCH_HAS_DEBUG_VM_PGTABLE=y
CONFIG_DEBUG_VM_IRQSOFF=y
CONFIG_DEBUG_VM=y
CONFIG_ARCH_HAS_DEBUG_VIRTUAL=y
CONFIG_DEBUG_MEMORY_INIT=y
CONFIG_LOCK_DEBUGGING_SUPPORT=y
CONFIG_DEBUG_RT_MUTEXES=y
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_MUTEXES=y
CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y
CONFIG_DEBUG_RWSEMS=y
CONFIG_DEBUG_LOCK_ALLOC=y
CONFIG_DEBUG_LIST=y
CONFIG_X86_DEBUG_FPU=y
CONFIG_FAULT_INJECTION_DEBUG_FS=y

(output from grep DEBUG .build/.config |grep -v ^#)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 07/14] iomap: track pending read bytes more optimally
  2025-10-24 20:59             ` Matthew Wilcox
@ 2025-10-24 21:37               ` Darrick J. Wong
  2025-10-24 21:58               ` Joanne Koong
  1 sibling, 0 replies; 28+ messages in thread
From: Darrick J. Wong @ 2025-10-24 21:37 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Joanne Koong, Brian Foster, brauner, miklos, hch, hsiangkao,
	linux-block, gfs2, linux-fsdevel, kernel-team, linux-xfs,
	linux-doc

On Fri, Oct 24, 2025 at 09:59:01PM +0100, Matthew Wilcox wrote:
> On Fri, Oct 24, 2025 at 12:22:32PM -0700, Joanne Koong wrote:
> > > Feels like more filesystem people should be enabling CONFIG_DEBUG_VM
> > > when testing (excluding performance testing of course; it'll do ugly
> > > things to your performance numbers).
> > 
> > Point taken. It looks like there's a bunch of other memory debugging
> > configs as well. Do you recommend enabling all of these when testing?
> > Do you have a particular .config you use for when you run tests?
> 
> Our Kconfig is far too ornate.  We could do with a "recommended for
> kernel developers" profile.  Here's what I'm currently using, though I
> know it's changed over time:

Is there any chance you could split the VM debug checks into cheap and
expensive ones, and create another kconfig option so that we could do
the cheap checks without having fstests take a lot longer?

You could also implement shenanigans like the following:
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=djwong-wtf&id=b739fff870384fd239abfd99ecee6bc47640794d

To enable the expensive checks at runtime.

(Yeah, I know, this is probably a 2 year project + bikeshed score of at
least 30...)

--D

> CONFIG_X86_DEBUGCTLMSR=y
> CONFIG_PM_DEBUG=y
> CONFIG_PM_SLEEP_DEBUG=y
> CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
> CONFIG_BLK_DEBUG_FS=y
> CONFIG_PNP_DEBUG_MESSAGES=y
> CONFIG_SCSI_DEBUG=m
> CONFIG_EXT4_DEBUG=y
> CONFIG_JFS_DEBUG=y
> CONFIG_XFS_DEBUG=y
> CONFIG_BTRFS_DEBUG=y
> CONFIG_UFS_DEBUG=y
> CONFIG_DEBUG_BUGVERBOSE=y
> CONFIG_DEBUG_KERNEL=y
> CONFIG_DEBUG_MISC=y
> CONFIG_DEBUG_INFO=y
> CONFIG_DEBUG_INFO_DWARF4=y
> CONFIG_DEBUG_INFO_COMPRESSED_NONE=y
> CONFIG_DEBUG_FS=y
> CONFIG_DEBUG_FS_ALLOW_ALL=y
> CONFIG_ARCH_HAS_EARLY_DEBUG=y
> CONFIG_SLUB_DEBUG=y
> CONFIG_ARCH_HAS_DEBUG_WX=y
> CONFIG_HAVE_DEBUG_KMEMLEAK=y
> CONFIG_SHRINKER_DEBUG=y
> CONFIG_ARCH_HAS_DEBUG_VM_PGTABLE=y
> CONFIG_DEBUG_VM_IRQSOFF=y
> CONFIG_DEBUG_VM=y
> CONFIG_ARCH_HAS_DEBUG_VIRTUAL=y
> CONFIG_DEBUG_MEMORY_INIT=y
> CONFIG_LOCK_DEBUGGING_SUPPORT=y
> CONFIG_DEBUG_RT_MUTEXES=y
> CONFIG_DEBUG_SPINLOCK=y
> CONFIG_DEBUG_MUTEXES=y
> CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y
> CONFIG_DEBUG_RWSEMS=y
> CONFIG_DEBUG_LOCK_ALLOC=y
> CONFIG_DEBUG_LIST=y
> CONFIG_X86_DEBUG_FPU=y
> CONFIG_FAULT_INJECTION_DEBUG_FS=y
> 
> (output from grep DEBUG .build/.config |grep -v ^#)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 07/14] iomap: track pending read bytes more optimally
  2025-10-24 20:59             ` Matthew Wilcox
  2025-10-24 21:37               ` Darrick J. Wong
@ 2025-10-24 21:58               ` Joanne Koong
  1 sibling, 0 replies; 28+ messages in thread
From: Joanne Koong @ 2025-10-24 21:58 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Brian Foster, brauner, miklos, djwong, hch, hsiangkao,
	linux-block, gfs2, linux-fsdevel, kernel-team, linux-xfs,
	linux-doc

On Fri, Oct 24, 2025 at 1:59 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Oct 24, 2025 at 12:22:32PM -0700, Joanne Koong wrote:
> > > Feels like more filesystem people should be enabling CONFIG_DEBUG_VM
> > > when testing (excluding performance testing of course; it'll do ugly
> > > things to your performance numbers).
> >
> > Point taken. It looks like there's a bunch of other memory debugging
> > configs as well. Do you recommend enabling all of these when testing?
> > Do you have a particular .config you use for when you run tests?
>
> Our Kconfig is far too ornate.  We could do with a "recommended for
> kernel developers" profile.  Here's what I'm currently using, though I
> know it's changed over time:
>
> CONFIG_X86_DEBUGCTLMSR=y
> CONFIG_PM_DEBUG=y
> CONFIG_PM_SLEEP_DEBUG=y
> CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
> CONFIG_BLK_DEBUG_FS=y
> CONFIG_PNP_DEBUG_MESSAGES=y
> CONFIG_SCSI_DEBUG=m
> CONFIG_EXT4_DEBUG=y
> CONFIG_JFS_DEBUG=y
> CONFIG_XFS_DEBUG=y
> CONFIG_BTRFS_DEBUG=y
> CONFIG_UFS_DEBUG=y
> CONFIG_DEBUG_BUGVERBOSE=y
> CONFIG_DEBUG_KERNEL=y
> CONFIG_DEBUG_MISC=y
> CONFIG_DEBUG_INFO=y
> CONFIG_DEBUG_INFO_DWARF4=y
> CONFIG_DEBUG_INFO_COMPRESSED_NONE=y
> CONFIG_DEBUG_FS=y
> CONFIG_DEBUG_FS_ALLOW_ALL=y
> CONFIG_ARCH_HAS_EARLY_DEBUG=y
> CONFIG_SLUB_DEBUG=y
> CONFIG_ARCH_HAS_DEBUG_WX=y
> CONFIG_HAVE_DEBUG_KMEMLEAK=y
> CONFIG_SHRINKER_DEBUG=y
> CONFIG_ARCH_HAS_DEBUG_VM_PGTABLE=y
> CONFIG_DEBUG_VM_IRQSOFF=y
> CONFIG_DEBUG_VM=y
> CONFIG_ARCH_HAS_DEBUG_VIRTUAL=y
> CONFIG_DEBUG_MEMORY_INIT=y
> CONFIG_LOCK_DEBUGGING_SUPPORT=y
> CONFIG_DEBUG_RT_MUTEXES=y
> CONFIG_DEBUG_SPINLOCK=y
> CONFIG_DEBUG_MUTEXES=y
> CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y
> CONFIG_DEBUG_RWSEMS=y
> CONFIG_DEBUG_LOCK_ALLOC=y
> CONFIG_DEBUG_LIST=y
> CONFIG_X86_DEBUG_FPU=y
> CONFIG_FAULT_INJECTION_DEBUG_FS=y
>
> (output from grep DEBUG .build/.config |grep -v ^#)

Thank you, I'll copy this.
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v5 08/14] iomap: set accurate iter->pos when reading folio ranges
  2025-09-26  0:25 [PATCH v5 00/14] fuse: use iomap for buffered reads + readahead Joanne Koong
                   ` (6 preceding siblings ...)
  2025-09-26  0:26 ` [PATCH v5 07/14] iomap: track pending read bytes more optimally Joanne Koong
@ 2025-09-26  0:26 ` Joanne Koong
  2025-09-26  0:26 ` [PATCH v5 09/14] iomap: add caller-provided callbacks for read and readahead Joanne Koong
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 28+ messages in thread
From: Joanne Koong @ 2025-09-26  0:26 UTC (permalink / raw)
  To: brauner, miklos
  Cc: djwong, hch, hsiangkao, linux-block, gfs2, linux-fsdevel,
	kernel-team, linux-xfs, linux-doc, Christoph Hellwig

Advance iter to the correct position before calling an IO helper to read
in a folio range. This allows the helper to reliably use iter->pos to
determine the starting offset for reading.

This will simplify the interface for reading in folio ranges when iomap
read/readahead supports caller-provided callbacks.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Suggested-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap/buffered-io.c | 21 +++++++++++++--------
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 4e6258fdb915..82bdf7c5e03c 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -375,10 +375,11 @@ static void iomap_bio_submit_read(struct iomap_read_folio_ctx *ctx)
 }
 
 static void iomap_bio_read_folio_range(const struct iomap_iter *iter,
-		struct iomap_read_folio_ctx *ctx, loff_t pos, size_t plen)
+		struct iomap_read_folio_ctx *ctx, size_t plen)
 {
 	struct folio *folio = ctx->cur_folio;
 	const struct iomap *iomap = &iter->iomap;
+	loff_t pos = iter->pos;
 	size_t poff = offset_in_folio(folio, pos);
 	loff_t length = iomap_length(iter);
 	sector_t sector;
@@ -470,7 +471,7 @@ static int iomap_read_folio_iter(struct iomap_iter *iter,
 	loff_t length = iomap_length(iter);
 	struct folio *folio = ctx->cur_folio;
 	size_t poff, plen;
-	loff_t count;
+	loff_t pos_diff;
 	int ret;
 
 	if (iomap->type == IOMAP_INLINE) {
@@ -488,12 +489,16 @@ static int iomap_read_folio_iter(struct iomap_iter *iter,
 		iomap_adjust_read_range(iter->inode, folio, &pos, length, &poff,
 				&plen);
 
-		count = pos - iter->pos + plen;
-		if (WARN_ON_ONCE(count > length))
+		pos_diff = pos - iter->pos;
+		if (WARN_ON_ONCE(pos_diff + plen > length))
 			return -EIO;
 
+		ret = iomap_iter_advance(iter, pos_diff);
+		if (ret)
+			return ret;
+
 		if (plen == 0)
-			return iomap_iter_advance(iter, count);
+			return 0;
 
 		/* zero post-eof blocks as the page may be mapped */
 		if (iomap_block_needs_zeroing(iter, pos)) {
@@ -503,13 +508,13 @@ static int iomap_read_folio_iter(struct iomap_iter *iter,
 			if (!*bytes_pending)
 				iomap_read_init(folio);
 			*bytes_pending += plen;
-			iomap_bio_read_folio_range(iter, ctx, pos, plen);
+			iomap_bio_read_folio_range(iter, ctx, plen);
 		}
 
-		ret = iomap_iter_advance(iter, count);
+		ret = iomap_iter_advance(iter, plen);
 		if (ret)
 			return ret;
-		length -= count;
+		length -= pos_diff + plen;
 		pos = iter->pos;
 	}
 	return 0;
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v5 09/14] iomap: add caller-provided callbacks for read and readahead
  2025-09-26  0:25 [PATCH v5 00/14] fuse: use iomap for buffered reads + readahead Joanne Koong
                   ` (7 preceding siblings ...)
  2025-09-26  0:26 ` [PATCH v5 08/14] iomap: set accurate iter->pos when reading folio ranges Joanne Koong
@ 2025-09-26  0:26 ` Joanne Koong
  2025-09-26  0:26 ` [PATCH v5 10/14] iomap: move buffered io bio logic into new file Joanne Koong
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 28+ messages in thread
From: Joanne Koong @ 2025-09-26  0:26 UTC (permalink / raw)
  To: brauner, miklos
  Cc: djwong, hch, hsiangkao, linux-block, gfs2, linux-fsdevel,
	kernel-team, linux-xfs, linux-doc

Add caller-provided callbacks for read and readahead so that it can be
used generically, especially by filesystems that are not block-based.

In particular, this:
* Modifies the read and readahead interface to take in a
  struct iomap_read_folio_ctx that is publicly defined as:

  struct iomap_read_folio_ctx {
	const struct iomap_read_ops *ops;
	struct folio *cur_folio;
	struct readahead_control *rac;
	void *read_ctx;
  };

  where struct iomap_read_ops is defined as:

  struct iomap_read_ops {
      int (*read_folio_range)(const struct iomap_iter *iter,
                             struct iomap_read_folio_ctx *ctx,
                             size_t len);
      void (*read_submit)(struct iomap_read_folio_ctx *ctx);
  };

  read_folio_range() reads in the folio range and is required by the
  caller to provide. read_submit() is optional and is used for
  submitting any pending read requests.

* Modifies existing filesystems that use iomap for read and readahead to
  use the new API, through the new statically inlined helpers
  iomap_bio_read_folio() and iomap_bio_readahead(). There is no change
  in functionality for those filesystems.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
 .../filesystems/iomap/operations.rst          | 44 +++++++++++++
 block/fops.c                                  |  5 +-
 fs/erofs/data.c                               |  5 +-
 fs/gfs2/aops.c                                |  6 +-
 fs/iomap/buffered-io.c                        | 55 ++++++++--------
 fs/xfs/xfs_aops.c                             |  5 +-
 fs/zonefs/file.c                              |  5 +-
 include/linux/iomap.h                         | 63 ++++++++++++++++++-
 8 files changed, 149 insertions(+), 39 deletions(-)

diff --git a/Documentation/filesystems/iomap/operations.rst b/Documentation/filesystems/iomap/operations.rst
index 067ed8e14ef3..cef3c3e76e9e 100644
--- a/Documentation/filesystems/iomap/operations.rst
+++ b/Documentation/filesystems/iomap/operations.rst
@@ -135,6 +135,28 @@ These ``struct kiocb`` flags are significant for buffered I/O with iomap:
 
  * ``IOCB_DONTCACHE``: Turns on ``IOMAP_DONTCACHE``.
 
+``struct iomap_read_ops``
+--------------------------
+
+.. code-block:: c
+
+ struct iomap_read_ops {
+     int (*read_folio_range)(const struct iomap_iter *iter,
+                             struct iomap_read_folio_ctx *ctx, size_t len);
+     void (*submit_read)(struct iomap_read_folio_ctx *ctx);
+ };
+
+iomap calls these functions:
+
+  - ``read_folio_range``: Called to read in the range. This must be provided
+    by the caller. The caller is responsible for calling
+    iomap_finish_folio_read() after reading in the folio range. This should be
+    done even if an error is encountered during the read. This returns 0 on
+    success or a negative error on failure.
+
+  - ``submit_read``: Submit any pending read requests. This function is
+    optional.
+
 Internal per-Folio State
 ------------------------
 
@@ -182,6 +204,28 @@ The ``flags`` argument to ``->iomap_begin`` will be set to zero.
 The pagecache takes whatever locks it needs before calling the
 filesystem.
 
+Both ``iomap_readahead`` and ``iomap_read_folio`` pass in a ``struct
+iomap_read_folio_ctx``:
+
+.. code-block:: c
+
+ struct iomap_read_folio_ctx {
+    const struct iomap_read_ops *ops;
+    struct folio *cur_folio;
+    struct readahead_control *rac;
+    void *read_ctx;
+ };
+
+``iomap_readahead`` must set:
+ * ``ops->read_folio_range()`` and ``rac``
+
+``iomap_read_folio`` must set:
+ * ``ops->read_folio_range()`` and ``cur_folio``
+
+``ops->submit_read()`` and ``read_ctx`` are optional. ``read_ctx`` is used to
+pass in any custom data the caller needs accessible in the ops callbacks for
+fulfilling reads.
+
 Buffered Writes
 ---------------
 
diff --git a/block/fops.c b/block/fops.c
index ddbc69c0922b..a2c2391d8dfa 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -533,12 +533,13 @@ const struct address_space_operations def_blk_aops = {
 #else /* CONFIG_BUFFER_HEAD */
 static int blkdev_read_folio(struct file *file, struct folio *folio)
 {
-	return iomap_read_folio(folio, &blkdev_iomap_ops);
+	iomap_bio_read_folio(folio, &blkdev_iomap_ops);
+	return 0;
 }
 
 static void blkdev_readahead(struct readahead_control *rac)
 {
-	iomap_readahead(rac, &blkdev_iomap_ops);
+	iomap_bio_readahead(rac, &blkdev_iomap_ops);
 }
 
 static ssize_t blkdev_writeback_range(struct iomap_writepage_ctx *wpc,
diff --git a/fs/erofs/data.c b/fs/erofs/data.c
index 3b1ba571c728..be4191b33321 100644
--- a/fs/erofs/data.c
+++ b/fs/erofs/data.c
@@ -371,7 +371,8 @@ static int erofs_read_folio(struct file *file, struct folio *folio)
 {
 	trace_erofs_read_folio(folio, true);
 
-	return iomap_read_folio(folio, &erofs_iomap_ops);
+	iomap_bio_read_folio(folio, &erofs_iomap_ops);
+	return 0;
 }
 
 static void erofs_readahead(struct readahead_control *rac)
@@ -379,7 +380,7 @@ static void erofs_readahead(struct readahead_control *rac)
 	trace_erofs_readahead(rac->mapping->host, readahead_index(rac),
 					readahead_count(rac), true);
 
-	return iomap_readahead(rac, &erofs_iomap_ops);
+	iomap_bio_readahead(rac, &erofs_iomap_ops);
 }
 
 static sector_t erofs_bmap(struct address_space *mapping, sector_t block)
diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 47d74afd63ac..38d4f343187a 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -424,11 +424,11 @@ static int gfs2_read_folio(struct file *file, struct folio *folio)
 	struct inode *inode = folio->mapping->host;
 	struct gfs2_inode *ip = GFS2_I(inode);
 	struct gfs2_sbd *sdp = GFS2_SB(inode);
-	int error;
+	int error = 0;
 
 	if (!gfs2_is_jdata(ip) ||
 	    (i_blocksize(inode) == PAGE_SIZE && !folio_buffers(folio))) {
-		error = iomap_read_folio(folio, &gfs2_iomap_ops);
+		iomap_bio_read_folio(folio, &gfs2_iomap_ops);
 	} else if (gfs2_is_stuffed(ip)) {
 		error = stuffed_read_folio(ip, folio);
 	} else {
@@ -503,7 +503,7 @@ static void gfs2_readahead(struct readahead_control *rac)
 	else if (gfs2_is_jdata(ip))
 		mpage_readahead(rac, gfs2_block_map);
 	else
-		iomap_readahead(rac, &gfs2_iomap_ops);
+		iomap_bio_readahead(rac, &gfs2_iomap_ops);
 }
 
 /**
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 82bdf7c5e03c..9e1f1f0f8bf1 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -328,8 +328,8 @@ static int iomap_read_inline_data(const struct iomap_iter *iter,
 }
 
 #ifdef CONFIG_BLOCK
-static void iomap_finish_folio_read(struct folio *folio, size_t off,
-		size_t len, int error)
+void iomap_finish_folio_read(struct folio *folio, size_t off, size_t len,
+		int error)
 {
 	struct iomap_folio_state *ifs = folio->private;
 	bool uptodate = !error;
@@ -349,6 +349,7 @@ static void iomap_finish_folio_read(struct folio *folio, size_t off,
 	if (finished)
 		folio_end_read(folio, uptodate);
 }
+EXPORT_SYMBOL_GPL(iomap_finish_folio_read);
 
 static void iomap_read_end_io(struct bio *bio)
 {
@@ -360,12 +361,6 @@ static void iomap_read_end_io(struct bio *bio)
 	bio_put(bio);
 }
 
-struct iomap_read_folio_ctx {
-	struct folio		*cur_folio;
-	void			*read_ctx;
-	struct readahead_control *rac;
-};
-
 static void iomap_bio_submit_read(struct iomap_read_folio_ctx *ctx)
 {
 	struct bio *bio = ctx->read_ctx;
@@ -374,7 +369,7 @@ static void iomap_bio_submit_read(struct iomap_read_folio_ctx *ctx)
 		submit_bio(bio);
 }
 
-static void iomap_bio_read_folio_range(const struct iomap_iter *iter,
+static int iomap_bio_read_folio_range(const struct iomap_iter *iter,
 		struct iomap_read_folio_ctx *ctx, size_t plen)
 {
 	struct folio *folio = ctx->cur_folio;
@@ -412,8 +407,15 @@ static void iomap_bio_read_folio_range(const struct iomap_iter *iter,
 		bio_add_folio_nofail(bio, folio, plen, poff);
 		ctx->read_ctx = bio;
 	}
+	return 0;
 }
 
+const struct iomap_read_ops iomap_bio_read_ops = {
+	.read_folio_range	= iomap_bio_read_folio_range,
+	.submit_read		= iomap_bio_submit_read,
+};
+EXPORT_SYMBOL_GPL(iomap_bio_read_ops);
+
 static void iomap_read_init(struct folio *folio)
 {
 	struct iomap_folio_state *ifs = folio->private;
@@ -508,7 +510,9 @@ static int iomap_read_folio_iter(struct iomap_iter *iter,
 			if (!*bytes_pending)
 				iomap_read_init(folio);
 			*bytes_pending += plen;
-			iomap_bio_read_folio_range(iter, ctx, plen);
+			ret = ctx->ops->read_folio_range(iter, ctx, plen);
+			if (ret)
+				return ret;
 		}
 
 		ret = iomap_iter_advance(iter, plen);
@@ -520,26 +524,25 @@ static int iomap_read_folio_iter(struct iomap_iter *iter,
 	return 0;
 }
 
-int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops)
+int iomap_read_folio(const struct iomap_ops *ops,
+		struct iomap_read_folio_ctx *ctx)
 {
+	struct folio *folio = ctx->cur_folio;
 	struct iomap_iter iter = {
 		.inode		= folio->mapping->host,
 		.pos		= folio_pos(folio),
 		.len		= folio_size(folio),
 	};
-	struct iomap_read_folio_ctx ctx = {
-		.cur_folio	= folio,
-	};
 	size_t bytes_pending = 0;
 	int ret;
 
 	trace_iomap_readpage(iter.inode, 1);
 
 	while ((ret = iomap_iter(&iter, ops)) > 0)
-		iter.status = iomap_read_folio_iter(&iter, &ctx,
-				&bytes_pending);
+		iter.status = iomap_read_folio_iter(&iter, ctx, &bytes_pending);
 
-	iomap_bio_submit_read(&ctx);
+	if (ctx->ops->submit_read)
+		ctx->ops->submit_read(ctx);
 
 	iomap_read_end(folio, bytes_pending);
 
@@ -579,8 +582,8 @@ static int iomap_readahead_iter(struct iomap_iter *iter,
 
 /**
  * iomap_readahead - Attempt to read pages from a file.
- * @rac: Describes the pages to be read.
  * @ops: The operations vector for the filesystem.
+ * @ctx: The ctx used for issuing readahead.
  *
  * This function is for filesystems to call to implement their readahead
  * address_space operation.
@@ -592,28 +595,28 @@ static int iomap_readahead_iter(struct iomap_iter *iter,
  * function is called with memalloc_nofs set, so allocations will not cause
  * the filesystem to be reentered.
  */
-void iomap_readahead(struct readahead_control *rac, const struct iomap_ops *ops)
+void iomap_readahead(const struct iomap_ops *ops,
+		struct iomap_read_folio_ctx *ctx)
 {
+	struct readahead_control *rac = ctx->rac;
 	struct iomap_iter iter = {
 		.inode	= rac->mapping->host,
 		.pos	= readahead_pos(rac),
 		.len	= readahead_length(rac),
 	};
-	struct iomap_read_folio_ctx ctx = {
-		.rac	= rac,
-	};
 	size_t cur_bytes_pending;
 
 	trace_iomap_readahead(rac->mapping->host, readahead_count(rac));
 
 	while (iomap_iter(&iter, ops) > 0)
-		iter.status = iomap_readahead_iter(&iter, &ctx,
+		iter.status = iomap_readahead_iter(&iter, ctx,
 					&cur_bytes_pending);
 
-	iomap_bio_submit_read(&ctx);
+	if (ctx->ops->submit_read)
+		ctx->ops->submit_read(ctx);
 
-	if (ctx.cur_folio)
-		iomap_read_end(ctx.cur_folio, cur_bytes_pending);
+	if (ctx->cur_folio)
+		iomap_read_end(ctx->cur_folio, cur_bytes_pending);
 }
 EXPORT_SYMBOL_GPL(iomap_readahead);
 
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index a26f79815533..0c2ed00733f2 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -742,14 +742,15 @@ xfs_vm_read_folio(
 	struct file		*unused,
 	struct folio		*folio)
 {
-	return iomap_read_folio(folio, &xfs_read_iomap_ops);
+	iomap_bio_read_folio(folio, &xfs_read_iomap_ops);
+	return 0;
 }
 
 STATIC void
 xfs_vm_readahead(
 	struct readahead_control	*rac)
 {
-	iomap_readahead(rac, &xfs_read_iomap_ops);
+	iomap_bio_readahead(rac, &xfs_read_iomap_ops);
 }
 
 static int
diff --git a/fs/zonefs/file.c b/fs/zonefs/file.c
index fd3a5922f6c3..4d6e7eb52966 100644
--- a/fs/zonefs/file.c
+++ b/fs/zonefs/file.c
@@ -112,12 +112,13 @@ static const struct iomap_ops zonefs_write_iomap_ops = {
 
 static int zonefs_read_folio(struct file *unused, struct folio *folio)
 {
-	return iomap_read_folio(folio, &zonefs_read_iomap_ops);
+	iomap_bio_read_folio(folio, &zonefs_read_iomap_ops);
+	return 0;
 }
 
 static void zonefs_readahead(struct readahead_control *rac)
 {
-	iomap_readahead(rac, &zonefs_read_iomap_ops);
+	iomap_bio_readahead(rac, &zonefs_read_iomap_ops);
 }
 
 /*
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 4469b2318b08..37435b912755 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -16,6 +16,7 @@ struct inode;
 struct iomap_iter;
 struct iomap_dio;
 struct iomap_writepage_ctx;
+struct iomap_read_folio_ctx;
 struct iov_iter;
 struct kiocb;
 struct page;
@@ -337,8 +338,10 @@ static inline bool iomap_want_unshare_iter(const struct iomap_iter *iter)
 ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from,
 		const struct iomap_ops *ops,
 		const struct iomap_write_ops *write_ops, void *private);
-int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops);
-void iomap_readahead(struct readahead_control *, const struct iomap_ops *ops);
+int iomap_read_folio(const struct iomap_ops *ops,
+		struct iomap_read_folio_ctx *ctx);
+void iomap_readahead(const struct iomap_ops *ops,
+		struct iomap_read_folio_ctx *ctx);
 bool iomap_is_partially_uptodate(struct folio *, size_t from, size_t count);
 struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len);
 bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags);
@@ -465,6 +468,8 @@ ssize_t iomap_add_to_ioend(struct iomap_writepage_ctx *wpc, struct folio *folio,
 		loff_t pos, loff_t end_pos, unsigned int dirty_len);
 int iomap_ioend_writeback_submit(struct iomap_writepage_ctx *wpc, int error);
 
+void iomap_finish_folio_read(struct folio *folio, size_t off, size_t len,
+		int error);
 void iomap_start_folio_write(struct inode *inode, struct folio *folio,
 		size_t len);
 void iomap_finish_folio_write(struct inode *inode, struct folio *folio,
@@ -473,6 +478,34 @@ void iomap_finish_folio_write(struct inode *inode, struct folio *folio,
 int iomap_writeback_folio(struct iomap_writepage_ctx *wpc, struct folio *folio);
 int iomap_writepages(struct iomap_writepage_ctx *wpc);
 
+struct iomap_read_folio_ctx {
+	const struct iomap_read_ops *ops;
+	struct folio		*cur_folio;
+	struct readahead_control *rac;
+	void			*read_ctx;
+};
+
+struct iomap_read_ops {
+	/*
+	 * Read in a folio range.
+	 *
+	 * The caller is responsible for calling iomap_finish_folio_read() after
+	 * reading in the folio range. This should be done even if an error is
+	 * encountered during the read.
+	 *
+	 * Returns 0 on success or a negative error on failure.
+	 */
+	int (*read_folio_range)(const struct iomap_iter *iter,
+			struct iomap_read_folio_ctx *ctx, size_t len);
+
+	/*
+	 * Submit any pending read requests.
+	 *
+	 * This is optional.
+	 */
+	void (*submit_read)(struct iomap_read_folio_ctx *ctx);
+};
+
 /*
  * Flags for direct I/O ->end_io:
  */
@@ -538,4 +571,30 @@ int iomap_swapfile_activate(struct swap_info_struct *sis,
 
 extern struct bio_set iomap_ioend_bioset;
 
+#ifdef CONFIG_BLOCK
+extern const struct iomap_read_ops iomap_bio_read_ops;
+
+static inline void iomap_bio_read_folio(struct folio *folio,
+		const struct iomap_ops *ops)
+{
+	struct iomap_read_folio_ctx ctx = {
+		.ops		= &iomap_bio_read_ops,
+		.cur_folio	= folio,
+	};
+
+	iomap_read_folio(ops, &ctx);
+}
+
+static inline void iomap_bio_readahead(struct readahead_control *rac,
+		const struct iomap_ops *ops)
+{
+	struct iomap_read_folio_ctx ctx = {
+		.ops		= &iomap_bio_read_ops,
+		.rac		= rac,
+	};
+
+	iomap_readahead(ops, &ctx);
+}
+#endif /* CONFIG_BLOCK */
+
 #endif /* LINUX_IOMAP_H */
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v5 10/14] iomap: move buffered io bio logic into new file
  2025-09-26  0:25 [PATCH v5 00/14] fuse: use iomap for buffered reads + readahead Joanne Koong
                   ` (8 preceding siblings ...)
  2025-09-26  0:26 ` [PATCH v5 09/14] iomap: add caller-provided callbacks for read and readahead Joanne Koong
@ 2025-09-26  0:26 ` Joanne Koong
  2025-09-26  0:26 ` [PATCH v5 11/14] iomap: make iomap_read_folio() a void return Joanne Koong
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 28+ messages in thread
From: Joanne Koong @ 2025-09-26  0:26 UTC (permalink / raw)
  To: brauner, miklos
  Cc: djwong, hch, hsiangkao, linux-block, gfs2, linux-fsdevel,
	kernel-team, linux-xfs, linux-doc

From: Christoph Hellwig <hch@lst.de> [1]

Move bio logic in the buffered io code into its own file and remove
CONFIG_BLOCK gating for iomap read/readahead.

[1] https://lore.kernel.org/linux-fsdevel/aMK2GuumUf93ep99@infradead.org/

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/iomap/Makefile      |  3 +-
 fs/iomap/bio.c         | 88 ++++++++++++++++++++++++++++++++++++++++++
 fs/iomap/buffered-io.c | 88 +-----------------------------------------
 fs/iomap/internal.h    | 12 ++++++
 4 files changed, 103 insertions(+), 88 deletions(-)
 create mode 100644 fs/iomap/bio.c

diff --git a/fs/iomap/Makefile b/fs/iomap/Makefile
index f7e1c8534c46..a572b8808524 100644
--- a/fs/iomap/Makefile
+++ b/fs/iomap/Makefile
@@ -14,5 +14,6 @@ iomap-y				+= trace.o \
 iomap-$(CONFIG_BLOCK)		+= direct-io.o \
 				   ioend.o \
 				   fiemap.o \
-				   seek.o
+				   seek.o \
+				   bio.o
 iomap-$(CONFIG_SWAP)		+= swapfile.o
diff --git a/fs/iomap/bio.c b/fs/iomap/bio.c
new file mode 100644
index 000000000000..fc045f2e4c45
--- /dev/null
+++ b/fs/iomap/bio.c
@@ -0,0 +1,88 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2010 Red Hat, Inc.
+ * Copyright (C) 2016-2023 Christoph Hellwig.
+ */
+#include <linux/iomap.h>
+#include <linux/pagemap.h>
+#include "internal.h"
+#include "trace.h"
+
+static void iomap_read_end_io(struct bio *bio)
+{
+	int error = blk_status_to_errno(bio->bi_status);
+	struct folio_iter fi;
+
+	bio_for_each_folio_all(fi, bio)
+		iomap_finish_folio_read(fi.folio, fi.offset, fi.length, error);
+	bio_put(bio);
+}
+
+static void iomap_bio_submit_read(struct iomap_read_folio_ctx *ctx)
+{
+	struct bio *bio = ctx->read_ctx;
+
+	if (bio)
+		submit_bio(bio);
+}
+
+static int iomap_bio_read_folio_range(const struct iomap_iter *iter,
+		struct iomap_read_folio_ctx *ctx, size_t plen)
+{
+	struct folio *folio = ctx->cur_folio;
+	const struct iomap *iomap = &iter->iomap;
+	loff_t pos = iter->pos;
+	size_t poff = offset_in_folio(folio, pos);
+	loff_t length = iomap_length(iter);
+	sector_t sector;
+	struct bio *bio = ctx->read_ctx;
+
+	sector = iomap_sector(iomap, pos);
+	if (!bio || bio_end_sector(bio) != sector ||
+	    !bio_add_folio(bio, folio, plen, poff)) {
+		gfp_t gfp = mapping_gfp_constraint(folio->mapping, GFP_KERNEL);
+		gfp_t orig_gfp = gfp;
+		unsigned int nr_vecs = DIV_ROUND_UP(length, PAGE_SIZE);
+
+		if (bio)
+			submit_bio(bio);
+
+		if (ctx->rac) /* same as readahead_gfp_mask */
+			gfp |= __GFP_NORETRY | __GFP_NOWARN;
+		bio = bio_alloc(iomap->bdev, bio_max_segs(nr_vecs), REQ_OP_READ,
+				     gfp);
+		/*
+		 * If the bio_alloc fails, try it again for a single page to
+		 * avoid having to deal with partial page reads.  This emulates
+		 * what do_mpage_read_folio does.
+		 */
+		if (!bio)
+			bio = bio_alloc(iomap->bdev, 1, REQ_OP_READ, orig_gfp);
+		if (ctx->rac)
+			bio->bi_opf |= REQ_RAHEAD;
+		bio->bi_iter.bi_sector = sector;
+		bio->bi_end_io = iomap_read_end_io;
+		bio_add_folio_nofail(bio, folio, plen, poff);
+		ctx->read_ctx = bio;
+	}
+	return 0;
+}
+
+const struct iomap_read_ops iomap_bio_read_ops = {
+	.read_folio_range = iomap_bio_read_folio_range,
+	.submit_read = iomap_bio_submit_read,
+};
+EXPORT_SYMBOL_GPL(iomap_bio_read_ops);
+
+int iomap_bio_read_folio_range_sync(const struct iomap_iter *iter,
+		struct folio *folio, loff_t pos, size_t len)
+{
+	const struct iomap *srcmap = iomap_iter_srcmap(iter);
+	struct bio_vec bvec;
+	struct bio bio;
+
+	bio_init(&bio, srcmap->bdev, &bvec, 1, REQ_OP_READ);
+	bio.bi_iter.bi_sector = iomap_sector(srcmap, pos);
+	bio_add_folio_nofail(&bio, folio, len, offset_in_folio(folio, pos));
+	return submit_bio_wait(&bio);
+}
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 9e1f1f0f8bf1..86c8094e5cc8 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -8,6 +8,7 @@
 #include <linux/writeback.h>
 #include <linux/swap.h>
 #include <linux/migrate.h>
+#include "internal.h"
 #include "trace.h"
 
 #include "../internal.h"
@@ -327,7 +328,6 @@ static int iomap_read_inline_data(const struct iomap_iter *iter,
 	return 0;
 }
 
-#ifdef CONFIG_BLOCK
 void iomap_finish_folio_read(struct folio *folio, size_t off, size_t len,
 		int error)
 {
@@ -351,71 +351,6 @@ void iomap_finish_folio_read(struct folio *folio, size_t off, size_t len,
 }
 EXPORT_SYMBOL_GPL(iomap_finish_folio_read);
 
-static void iomap_read_end_io(struct bio *bio)
-{
-	int error = blk_status_to_errno(bio->bi_status);
-	struct folio_iter fi;
-
-	bio_for_each_folio_all(fi, bio)
-		iomap_finish_folio_read(fi.folio, fi.offset, fi.length, error);
-	bio_put(bio);
-}
-
-static void iomap_bio_submit_read(struct iomap_read_folio_ctx *ctx)
-{
-	struct bio *bio = ctx->read_ctx;
-
-	if (bio)
-		submit_bio(bio);
-}
-
-static int iomap_bio_read_folio_range(const struct iomap_iter *iter,
-		struct iomap_read_folio_ctx *ctx, size_t plen)
-{
-	struct folio *folio = ctx->cur_folio;
-	const struct iomap *iomap = &iter->iomap;
-	loff_t pos = iter->pos;
-	size_t poff = offset_in_folio(folio, pos);
-	loff_t length = iomap_length(iter);
-	sector_t sector;
-	struct bio *bio = ctx->read_ctx;
-
-	sector = iomap_sector(iomap, pos);
-	if (!bio || bio_end_sector(bio) != sector ||
-	    !bio_add_folio(bio, folio, plen, poff)) {
-		gfp_t gfp = mapping_gfp_constraint(folio->mapping, GFP_KERNEL);
-		gfp_t orig_gfp = gfp;
-		unsigned int nr_vecs = DIV_ROUND_UP(length, PAGE_SIZE);
-
-		iomap_bio_submit_read(ctx);
-
-		if (ctx->rac) /* same as readahead_gfp_mask */
-			gfp |= __GFP_NORETRY | __GFP_NOWARN;
-		bio = bio_alloc(iomap->bdev, bio_max_segs(nr_vecs), REQ_OP_READ,
-				     gfp);
-		/*
-		 * If the bio_alloc fails, try it again for a single page to
-		 * avoid having to deal with partial page reads.  This emulates
-		 * what do_mpage_read_folio does.
-		 */
-		if (!bio)
-			bio = bio_alloc(iomap->bdev, 1, REQ_OP_READ, orig_gfp);
-		if (ctx->rac)
-			bio->bi_opf |= REQ_RAHEAD;
-		bio->bi_iter.bi_sector = sector;
-		bio->bi_end_io = iomap_read_end_io;
-		bio_add_folio_nofail(bio, folio, plen, poff);
-		ctx->read_ctx = bio;
-	}
-	return 0;
-}
-
-const struct iomap_read_ops iomap_bio_read_ops = {
-	.read_folio_range	= iomap_bio_read_folio_range,
-	.submit_read		= iomap_bio_submit_read,
-};
-EXPORT_SYMBOL_GPL(iomap_bio_read_ops);
-
 static void iomap_read_init(struct folio *folio)
 {
 	struct iomap_folio_state *ifs = folio->private;
@@ -620,27 +555,6 @@ void iomap_readahead(const struct iomap_ops *ops,
 }
 EXPORT_SYMBOL_GPL(iomap_readahead);
 
-static int iomap_bio_read_folio_range_sync(const struct iomap_iter *iter,
-		struct folio *folio, loff_t pos, size_t len)
-{
-	const struct iomap *srcmap = iomap_iter_srcmap(iter);
-	struct bio_vec bvec;
-	struct bio bio;
-
-	bio_init(&bio, srcmap->bdev, &bvec, 1, REQ_OP_READ);
-	bio.bi_iter.bi_sector = iomap_sector(srcmap, pos);
-	bio_add_folio_nofail(&bio, folio, len, offset_in_folio(folio, pos));
-	return submit_bio_wait(&bio);
-}
-#else
-static int iomap_bio_read_folio_range_sync(const struct iomap_iter *iter,
-		struct folio *folio, loff_t pos, size_t len)
-{
-	WARN_ON_ONCE(1);
-	return -EIO;
-}
-#endif /* CONFIG_BLOCK */
-
 /*
  * iomap_is_partially_uptodate checks whether blocks within a folio are
  * uptodate or not.
diff --git a/fs/iomap/internal.h b/fs/iomap/internal.h
index d05cb3aed96e..3a4e4aad2bd1 100644
--- a/fs/iomap/internal.h
+++ b/fs/iomap/internal.h
@@ -6,4 +6,16 @@
 
 u32 iomap_finish_ioend_direct(struct iomap_ioend *ioend);
 
+#ifdef CONFIG_BLOCK
+int iomap_bio_read_folio_range_sync(const struct iomap_iter *iter,
+		struct folio *folio, loff_t pos, size_t len);
+#else
+static inline int iomap_bio_read_folio_range_sync(const struct iomap_iter *iter,
+		struct folio *folio, loff_t pos, size_t len)
+{
+	WARN_ON_ONCE(1);
+	return -EIO;
+}
+#endif /* CONFIG_BLOCK */
+
 #endif /* _IOMAP_INTERNAL_H */
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v5 11/14] iomap: make iomap_read_folio() a void return
  2025-09-26  0:25 [PATCH v5 00/14] fuse: use iomap for buffered reads + readahead Joanne Koong
                   ` (9 preceding siblings ...)
  2025-09-26  0:26 ` [PATCH v5 10/14] iomap: move buffered io bio logic into new file Joanne Koong
@ 2025-09-26  0:26 ` Joanne Koong
  2025-09-26  0:26 ` [PATCH v5 12/14] fuse: use iomap for read_folio Joanne Koong
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 28+ messages in thread
From: Joanne Koong @ 2025-09-26  0:26 UTC (permalink / raw)
  To: brauner, miklos
  Cc: djwong, hch, hsiangkao, linux-block, gfs2, linux-fsdevel,
	kernel-team, linux-xfs, linux-doc

No errors are propagated in iomap_read_folio(). Change
iomap_read_folio() to a void return to make this clearer to callers.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/iomap/buffered-io.c | 9 +--------
 include/linux/iomap.h  | 2 +-
 2 files changed, 2 insertions(+), 9 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 86c8094e5cc8..f9ae72713f74 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -459,7 +459,7 @@ static int iomap_read_folio_iter(struct iomap_iter *iter,
 	return 0;
 }
 
-int iomap_read_folio(const struct iomap_ops *ops,
+void iomap_read_folio(const struct iomap_ops *ops,
 		struct iomap_read_folio_ctx *ctx)
 {
 	struct folio *folio = ctx->cur_folio;
@@ -480,13 +480,6 @@ int iomap_read_folio(const struct iomap_ops *ops,
 		ctx->ops->submit_read(ctx);
 
 	iomap_read_end(folio, bytes_pending);
-
-	/*
-	 * Just like mpage_readahead and block_read_full_folio, we always
-	 * return 0 and just set the folio error flag on errors.  This
-	 * should be cleaned up throughout the stack eventually.
-	 */
-	return 0;
 }
 EXPORT_SYMBOL_GPL(iomap_read_folio);
 
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 37435b912755..6d864b446b6e 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -338,7 +338,7 @@ static inline bool iomap_want_unshare_iter(const struct iomap_iter *iter)
 ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from,
 		const struct iomap_ops *ops,
 		const struct iomap_write_ops *write_ops, void *private);
-int iomap_read_folio(const struct iomap_ops *ops,
+void iomap_read_folio(const struct iomap_ops *ops,
 		struct iomap_read_folio_ctx *ctx);
 void iomap_readahead(const struct iomap_ops *ops,
 		struct iomap_read_folio_ctx *ctx);
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v5 12/14] fuse: use iomap for read_folio
  2025-09-26  0:25 [PATCH v5 00/14] fuse: use iomap for buffered reads + readahead Joanne Koong
                   ` (10 preceding siblings ...)
  2025-09-26  0:26 ` [PATCH v5 11/14] iomap: make iomap_read_folio() a void return Joanne Koong
@ 2025-09-26  0:26 ` Joanne Koong
  2025-09-26  0:26 ` [PATCH v5 13/14] fuse: use iomap for readahead Joanne Koong
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 28+ messages in thread
From: Joanne Koong @ 2025-09-26  0:26 UTC (permalink / raw)
  To: brauner, miklos
  Cc: djwong, hch, hsiangkao, linux-block, gfs2, linux-fsdevel,
	kernel-team, linux-xfs, linux-doc

Read folio data into the page cache using iomap. This gives us granular
uptodate tracking for large folios, which optimizes how much data needs
to be read in. If some portions of the folio are already uptodate (eg
through a prior write), we only need to read in the non-uptodate
portions.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/fuse/file.c | 80 +++++++++++++++++++++++++++++++++++---------------
 1 file changed, 56 insertions(+), 24 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 4adcf09d4b01..db93c83ee4a3 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -828,23 +828,69 @@ static int fuse_do_readfolio(struct file *file, struct folio *folio,
 	return 0;
 }
 
+static int fuse_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
+			    unsigned int flags, struct iomap *iomap,
+			    struct iomap *srcmap)
+{
+	iomap->type = IOMAP_MAPPED;
+	iomap->length = length;
+	iomap->offset = offset;
+	return 0;
+}
+
+static const struct iomap_ops fuse_iomap_ops = {
+	.iomap_begin	= fuse_iomap_begin,
+};
+
+struct fuse_fill_read_data {
+	struct file *file;
+};
+
+static int fuse_iomap_read_folio_range_async(const struct iomap_iter *iter,
+					     struct iomap_read_folio_ctx *ctx,
+					     size_t len)
+{
+	struct fuse_fill_read_data *data = ctx->read_ctx;
+	struct folio *folio = ctx->cur_folio;
+	loff_t pos =  iter->pos;
+	size_t off = offset_in_folio(folio, pos);
+	struct file *file = data->file;
+	int ret;
+
+	/*
+	 *  for non-readahead read requests, do reads synchronously since
+	 *  it's not guaranteed that the server can handle out-of-order reads
+	 */
+	ret = fuse_do_readfolio(file, folio, off, len);
+	iomap_finish_folio_read(folio, off, len, ret);
+	return ret;
+}
+
+static const struct iomap_read_ops fuse_iomap_read_ops = {
+	.read_folio_range = fuse_iomap_read_folio_range_async,
+};
+
 static int fuse_read_folio(struct file *file, struct folio *folio)
 {
 	struct inode *inode = folio->mapping->host;
-	int err;
+	struct fuse_fill_read_data data = {
+		.file = file,
+	};
+	struct iomap_read_folio_ctx ctx = {
+		.cur_folio = folio,
+		.ops = &fuse_iomap_read_ops,
+		.read_ctx = &data,
 
-	err = -EIO;
-	if (fuse_is_bad(inode))
-		goto out;
+	};
 
-	err = fuse_do_readfolio(file, folio, 0, folio_size(folio));
-	if (!err)
-		folio_mark_uptodate(folio);
+	if (fuse_is_bad(inode)) {
+		folio_unlock(folio);
+		return -EIO;
+	}
 
+	iomap_read_folio(&fuse_iomap_ops, &ctx);
 	fuse_invalidate_atime(inode);
- out:
-	folio_unlock(folio);
-	return err;
+	return 0;
 }
 
 static int fuse_iomap_read_folio_range(const struct iomap_iter *iter,
@@ -1394,20 +1440,6 @@ static const struct iomap_write_ops fuse_iomap_write_ops = {
 	.read_folio_range = fuse_iomap_read_folio_range,
 };
 
-static int fuse_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
-			    unsigned int flags, struct iomap *iomap,
-			    struct iomap *srcmap)
-{
-	iomap->type = IOMAP_MAPPED;
-	iomap->length = length;
-	iomap->offset = offset;
-	return 0;
-}
-
-static const struct iomap_ops fuse_iomap_ops = {
-	.iomap_begin	= fuse_iomap_begin,
-};
-
 static ssize_t fuse_cache_write_iter(struct kiocb *iocb, struct iov_iter *from)
 {
 	struct file *file = iocb->ki_filp;
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v5 13/14] fuse: use iomap for readahead
  2025-09-26  0:25 [PATCH v5 00/14] fuse: use iomap for buffered reads + readahead Joanne Koong
                   ` (11 preceding siblings ...)
  2025-09-26  0:26 ` [PATCH v5 12/14] fuse: use iomap for read_folio Joanne Koong
@ 2025-09-26  0:26 ` Joanne Koong
  2025-09-26  0:26 ` [PATCH v5 14/14] fuse: remove fc->blkbits workaround for partial writes Joanne Koong
  2025-09-29  9:38 ` [PATCH v5 00/14] fuse: use iomap for buffered reads + readahead Christian Brauner
  14 siblings, 0 replies; 28+ messages in thread
From: Joanne Koong @ 2025-09-26  0:26 UTC (permalink / raw)
  To: brauner, miklos
  Cc: djwong, hch, hsiangkao, linux-block, gfs2, linux-fsdevel,
	kernel-team, linux-xfs, linux-doc

Do readahead in fuse using iomap. This gives us granular uptodate
tracking for large folios, which optimizes how much data needs to be
read in. If some portions of the folio are already uptodate (eg through
a prior write), we only need to read in the non-uptodate portions.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/fuse/file.c | 220 ++++++++++++++++++++++++++++---------------------
 1 file changed, 124 insertions(+), 96 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index db93c83ee4a3..7c9c00784e33 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -844,8 +844,65 @@ static const struct iomap_ops fuse_iomap_ops = {
 
 struct fuse_fill_read_data {
 	struct file *file;
+
+	/* Fields below are used if sending the read request asynchronously */
+	struct fuse_conn *fc;
+	struct fuse_io_args *ia;
+	unsigned int nr_bytes;
 };
 
+/* forward declarations */
+static bool fuse_folios_need_send(struct fuse_conn *fc, loff_t pos,
+				  unsigned len, struct fuse_args_pages *ap,
+				  unsigned cur_bytes, bool write);
+static void fuse_send_readpages(struct fuse_io_args *ia, struct file *file,
+				unsigned int count, bool async);
+
+static int fuse_handle_readahead(struct folio *folio,
+				 struct readahead_control *rac,
+				 struct fuse_fill_read_data *data, loff_t pos,
+				 size_t len)
+{
+	struct fuse_io_args *ia = data->ia;
+	size_t off = offset_in_folio(folio, pos);
+	struct fuse_conn *fc = data->fc;
+	struct fuse_args_pages *ap;
+	unsigned int nr_pages;
+
+	if (ia && fuse_folios_need_send(fc, pos, len, &ia->ap, data->nr_bytes,
+					false)) {
+		fuse_send_readpages(ia, data->file, data->nr_bytes,
+				    fc->async_read);
+		data->nr_bytes = 0;
+		data->ia = NULL;
+		ia = NULL;
+	}
+	if (!ia) {
+		if (fc->num_background >= fc->congestion_threshold &&
+		    rac->ra->async_size >= readahead_count(rac))
+			/*
+			 * Congested and only async pages left, so skip the
+			 * rest.
+			 */
+			return -EAGAIN;
+
+		nr_pages = min(fc->max_pages, readahead_count(rac));
+		data->ia = fuse_io_alloc(NULL, nr_pages);
+		if (!data->ia)
+			return -ENOMEM;
+		ia = data->ia;
+	}
+	folio_get(folio);
+	ap = &ia->ap;
+	ap->folios[ap->num_folios] = folio;
+	ap->descs[ap->num_folios].offset = off;
+	ap->descs[ap->num_folios].length = len;
+	data->nr_bytes += len;
+	ap->num_folios++;
+
+	return 0;
+}
+
 static int fuse_iomap_read_folio_range_async(const struct iomap_iter *iter,
 					     struct iomap_read_folio_ctx *ctx,
 					     size_t len)
@@ -857,17 +914,39 @@ static int fuse_iomap_read_folio_range_async(const struct iomap_iter *iter,
 	struct file *file = data->file;
 	int ret;
 
-	/*
-	 *  for non-readahead read requests, do reads synchronously since
-	 *  it's not guaranteed that the server can handle out-of-order reads
-	 */
-	ret = fuse_do_readfolio(file, folio, off, len);
-	iomap_finish_folio_read(folio, off, len, ret);
+	if (ctx->rac) {
+		ret = fuse_handle_readahead(folio, ctx->rac, data, pos, len);
+		/*
+		 * If fuse_handle_readahead was successful, fuse_readpages_end
+		 * will do the iomap_finish_folio_read, else we need to call it
+		 * here
+		 */
+		if (ret)
+			iomap_finish_folio_read(folio, off, len, ret);
+	} else {
+		/*
+		 *  for non-readahead read requests, do reads synchronously
+		 *  since it's not guaranteed that the server can handle
+		 *  out-of-order reads
+		 */
+		ret = fuse_do_readfolio(file, folio, off, len);
+		iomap_finish_folio_read(folio, off, len, ret);
+	}
 	return ret;
 }
 
+static void fuse_iomap_read_submit(struct iomap_read_folio_ctx *ctx)
+{
+	struct fuse_fill_read_data *data = ctx->read_ctx;
+
+	if (data->ia)
+		fuse_send_readpages(data->ia, data->file, data->nr_bytes,
+				    data->fc->async_read);
+}
+
 static const struct iomap_read_ops fuse_iomap_read_ops = {
 	.read_folio_range = fuse_iomap_read_folio_range_async,
+	.submit_read = fuse_iomap_read_submit,
 };
 
 static int fuse_read_folio(struct file *file, struct folio *folio)
@@ -929,7 +1008,8 @@ static void fuse_readpages_end(struct fuse_mount *fm, struct fuse_args *args,
 	}
 
 	for (i = 0; i < ap->num_folios; i++) {
-		folio_end_read(ap->folios[i], !err);
+		iomap_finish_folio_read(ap->folios[i], ap->descs[i].offset,
+					ap->descs[i].length, err);
 		folio_put(ap->folios[i]);
 	}
 	if (ia->ff)
@@ -939,7 +1019,7 @@ static void fuse_readpages_end(struct fuse_mount *fm, struct fuse_args *args,
 }
 
 static void fuse_send_readpages(struct fuse_io_args *ia, struct file *file,
-				unsigned int count)
+				unsigned int count, bool async)
 {
 	struct fuse_file *ff = file->private_data;
 	struct fuse_mount *fm = ff->fm;
@@ -961,7 +1041,7 @@ static void fuse_send_readpages(struct fuse_io_args *ia, struct file *file,
 
 	fuse_read_args_fill(ia, file, pos, count, FUSE_READ);
 	ia->read.attr_ver = fuse_get_attr_version(fm->fc);
-	if (fm->fc->async_read) {
+	if (async) {
 		ia->ff = fuse_file_get(ff);
 		ap->args.end = fuse_readpages_end;
 		err = fuse_simple_background(fm, &ap->args, GFP_KERNEL);
@@ -978,81 +1058,20 @@ static void fuse_readahead(struct readahead_control *rac)
 {
 	struct inode *inode = rac->mapping->host;
 	struct fuse_conn *fc = get_fuse_conn(inode);
-	unsigned int max_pages, nr_pages;
-	struct folio *folio = NULL;
+	struct fuse_fill_read_data data = {
+		.file = rac->file,
+		.fc = fc,
+	};
+	struct iomap_read_folio_ctx ctx = {
+		.ops = &fuse_iomap_read_ops,
+		.rac = rac,
+		.read_ctx = &data
+	};
 
 	if (fuse_is_bad(inode))
 		return;
 
-	max_pages = min_t(unsigned int, fc->max_pages,
-			fc->max_read / PAGE_SIZE);
-
-	/*
-	 * This is only accurate the first time through, since readahead_folio()
-	 * doesn't update readahead_count() from the previous folio until the
-	 * next call.  Grab nr_pages here so we know how many pages we're going
-	 * to have to process.  This means that we will exit here with
-	 * readahead_count() == folio_nr_pages(last_folio), but we will have
-	 * consumed all of the folios, and read_pages() will call
-	 * readahead_folio() again which will clean up the rac.
-	 */
-	nr_pages = readahead_count(rac);
-
-	while (nr_pages) {
-		struct fuse_io_args *ia;
-		struct fuse_args_pages *ap;
-		unsigned cur_pages = min(max_pages, nr_pages);
-		unsigned int pages = 0;
-
-		if (fc->num_background >= fc->congestion_threshold &&
-		    rac->ra->async_size >= readahead_count(rac))
-			/*
-			 * Congested and only async pages left, so skip the
-			 * rest.
-			 */
-			break;
-
-		ia = fuse_io_alloc(NULL, cur_pages);
-		if (!ia)
-			break;
-		ap = &ia->ap;
-
-		while (pages < cur_pages) {
-			unsigned int folio_pages;
-
-			/*
-			 * This returns a folio with a ref held on it.
-			 * The ref needs to be held until the request is
-			 * completed, since the splice case (see
-			 * fuse_try_move_page()) drops the ref after it's
-			 * replaced in the page cache.
-			 */
-			if (!folio)
-				folio =  __readahead_folio(rac);
-
-			folio_pages = folio_nr_pages(folio);
-			if (folio_pages > cur_pages - pages) {
-				/*
-				 * Large folios belonging to fuse will never
-				 * have more pages than max_pages.
-				 */
-				WARN_ON(!pages);
-				break;
-			}
-
-			ap->folios[ap->num_folios] = folio;
-			ap->descs[ap->num_folios].length = folio_size(folio);
-			ap->num_folios++;
-			pages += folio_pages;
-			folio = NULL;
-		}
-		fuse_send_readpages(ia, rac->file, pages << PAGE_SHIFT);
-		nr_pages -= pages;
-	}
-	if (folio) {
-		folio_end_read(folio, false);
-		folio_put(folio);
-	}
+	iomap_readahead(&fuse_iomap_ops, &ctx);
 }
 
 static ssize_t fuse_cache_read_iter(struct kiocb *iocb, struct iov_iter *to)
@@ -2083,7 +2102,7 @@ struct fuse_fill_wb_data {
 	struct fuse_file *ff;
 	unsigned int max_folios;
 	/*
-	 * nr_bytes won't overflow since fuse_writepage_need_send() caps
+	 * nr_bytes won't overflow since fuse_folios_need_send() caps
 	 * wb requests to never exceed fc->max_pages (which has an upper bound
 	 * of U16_MAX).
 	 */
@@ -2128,14 +2147,15 @@ static void fuse_writepages_send(struct inode *inode,
 	spin_unlock(&fi->lock);
 }
 
-static bool fuse_writepage_need_send(struct fuse_conn *fc, loff_t pos,
-				     unsigned len, struct fuse_args_pages *ap,
-				     struct fuse_fill_wb_data *data)
+static bool fuse_folios_need_send(struct fuse_conn *fc, loff_t pos,
+				  unsigned len, struct fuse_args_pages *ap,
+				  unsigned cur_bytes, bool write)
 {
 	struct folio *prev_folio;
 	struct fuse_folio_desc prev_desc;
-	unsigned bytes = data->nr_bytes + len;
+	unsigned bytes = cur_bytes + len;
 	loff_t prev_pos;
+	size_t max_bytes = write ? fc->max_write : fc->max_read;
 
 	WARN_ON(!ap->num_folios);
 
@@ -2143,8 +2163,7 @@ static bool fuse_writepage_need_send(struct fuse_conn *fc, loff_t pos,
 	if ((bytes + PAGE_SIZE - 1) >> PAGE_SHIFT > fc->max_pages)
 		return true;
 
-	/* Reached max write bytes */
-	if (bytes > fc->max_write)
+	if (bytes > max_bytes)
 		return true;
 
 	/* Discontinuity */
@@ -2154,11 +2173,6 @@ static bool fuse_writepage_need_send(struct fuse_conn *fc, loff_t pos,
 	if (prev_pos != pos)
 		return true;
 
-	/* Need to grow the pages array?  If so, did the expansion fail? */
-	if (ap->num_folios == data->max_folios &&
-	    !fuse_pages_realloc(data, fc->max_pages))
-		return true;
-
 	return false;
 }
 
@@ -2182,10 +2196,24 @@ static ssize_t fuse_iomap_writeback_range(struct iomap_writepage_ctx *wpc,
 			return -EIO;
 	}
 
-	if (wpa && fuse_writepage_need_send(fc, pos, len, ap, data)) {
-		fuse_writepages_send(inode, data);
-		data->wpa = NULL;
-		data->nr_bytes = 0;
+	if (wpa) {
+		bool send = fuse_folios_need_send(fc, pos, len, ap,
+						  data->nr_bytes, true);
+
+		if (!send) {
+			/*
+			 * Need to grow the pages array?  If so, did the
+			 * expansion fail?
+			 */
+			send = (ap->num_folios == data->max_folios) &&
+				!fuse_pages_realloc(data, fc->max_pages);
+		}
+
+		if (send) {
+			fuse_writepages_send(inode, data);
+			data->wpa = NULL;
+			data->nr_bytes = 0;
+		}
 	}
 
 	if (data->wpa == NULL) {
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v5 14/14] fuse: remove fc->blkbits workaround for partial writes
  2025-09-26  0:25 [PATCH v5 00/14] fuse: use iomap for buffered reads + readahead Joanne Koong
                   ` (12 preceding siblings ...)
  2025-09-26  0:26 ` [PATCH v5 13/14] fuse: use iomap for readahead Joanne Koong
@ 2025-09-26  0:26 ` Joanne Koong
  2025-09-29  9:38 ` [PATCH v5 00/14] fuse: use iomap for buffered reads + readahead Christian Brauner
  14 siblings, 0 replies; 28+ messages in thread
From: Joanne Koong @ 2025-09-26  0:26 UTC (permalink / raw)
  To: brauner, miklos
  Cc: djwong, hch, hsiangkao, linux-block, gfs2, linux-fsdevel,
	kernel-team, linux-xfs, linux-doc

Now that fuse is integrated with iomap for read/readahead, we can remove
the workaround that was added in commit bd24d2108e9c ("fuse: fix fuseblk
i_blkbits for iomap partial writes"), which was previously needed to
avoid a race condition where an iomap partial write may be overwritten
by a read if blocksize < PAGE_SIZE. Now that fuse does iomap
read/readahead, this is protected against since there is granular
uptodate tracking of blocks, which means this workaround can be removed.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/fuse/dir.c    |  2 +-
 fs/fuse/fuse_i.h |  8 --------
 fs/fuse/inode.c  | 13 +------------
 3 files changed, 2 insertions(+), 21 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 5c569c3cb53f..ebee7e0b1cd3 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1199,7 +1199,7 @@ static void fuse_fillattr(struct mnt_idmap *idmap, struct inode *inode,
 	if (attr->blksize != 0)
 		blkbits = ilog2(attr->blksize);
 	else
-		blkbits = fc->blkbits;
+		blkbits = inode->i_sb->s_blocksize_bits;
 
 	stat->blksize = 1 << blkbits;
 }
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index cc428d04be3e..1647eb7ca6fa 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -975,14 +975,6 @@ struct fuse_conn {
 		/* Request timeout (in jiffies). 0 = no timeout */
 		unsigned int req_timeout;
 	} timeout;
-
-	/*
-	 * This is a workaround until fuse uses iomap for reads.
-	 * For fuseblk servers, this represents the blocksize passed in at
-	 * mount time and for regular fuse servers, this is equivalent to
-	 * inode->i_blkbits.
-	 */
-	u8 blkbits;
 };
 
 /*
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 7485a41af892..a1b9e8587155 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -292,7 +292,7 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
 	if (attr->blksize)
 		fi->cached_i_blkbits = ilog2(attr->blksize);
 	else
-		fi->cached_i_blkbits = fc->blkbits;
+		fi->cached_i_blkbits = inode->i_sb->s_blocksize_bits;
 
 	/*
 	 * Don't set the sticky bit in i_mode, unless we want the VFS
@@ -1810,21 +1810,10 @@ int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx)
 		err = -EINVAL;
 		if (!sb_set_blocksize(sb, ctx->blksize))
 			goto err;
-		/*
-		 * This is a workaround until fuse hooks into iomap for reads.
-		 * Use PAGE_SIZE for the blocksize else if the writeback cache
-		 * is enabled, buffered writes go through iomap and a read may
-		 * overwrite partially written data if blocksize < PAGE_SIZE
-		 */
-		fc->blkbits = sb->s_blocksize_bits;
-		if (ctx->blksize != PAGE_SIZE &&
-		    !sb_set_blocksize(sb, PAGE_SIZE))
-			goto err;
 #endif
 	} else {
 		sb->s_blocksize = PAGE_SIZE;
 		sb->s_blocksize_bits = PAGE_SHIFT;
-		fc->blkbits = sb->s_blocksize_bits;
 	}
 
 	sb->s_subtype = ctx->subtype;
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 00/14] fuse: use iomap for buffered reads + readahead
  2025-09-26  0:25 [PATCH v5 00/14] fuse: use iomap for buffered reads + readahead Joanne Koong
                   ` (13 preceding siblings ...)
  2025-09-26  0:26 ` [PATCH v5 14/14] fuse: remove fc->blkbits workaround for partial writes Joanne Koong
@ 2025-09-29  9:38 ` Christian Brauner
  14 siblings, 0 replies; 28+ messages in thread
From: Christian Brauner @ 2025-09-29  9:38 UTC (permalink / raw)
  To: miklos, Joanne Koong
  Cc: Christian Brauner, djwong, hch, linux-block, gfs2, linux-fsdevel,
	kernel-team, linux-xfs, linux-doc, Gao Xiang

On Thu, 25 Sep 2025 17:25:55 -0700, Joanne Koong wrote:
> This series adds fuse iomap support for buffered reads and readahead.
> This is needed so that granular uptodate tracking can be used in fuse when
> large folios are enabled so that only the non-uptodate portions of the folio
> need to be read in instead of having to read in the entire folio. It also is
> needed in order to turn on large folios for servers that use the writeback
> cache since otherwise there is a race condition that may lead to data
> corruption if there is a partial write, then a read and the read happens
> before the write has undergone writeback, since otherwise the folio will not
> be marked uptodate from the partial write so the read will read in the entire
> folio from disk, which will overwrite the partial write.
> 
> [...]

Applied to the vfs-6.19.iomap branch of the vfs/vfs.git tree.
Patches in the vfs-6.19.iomap branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs-6.19.iomap

[01/14] iomap: move bio read logic into helper function
        https://git.kernel.org/vfs/vfs/c/4b1f54633425
[02/14] iomap: move read/readahead bio submission logic into helper function
        https://git.kernel.org/vfs/vfs/c/22159441469a
[03/14] iomap: store read/readahead bio generically
        https://git.kernel.org/vfs/vfs/c/7c732b99c04f
[04/14] iomap: iterate over folio mapping in iomap_readpage_iter()
        https://git.kernel.org/vfs/vfs/c/3b404627d3e2
[05/14] iomap: rename iomap_readpage_iter() to iomap_read_folio_iter()
        https://git.kernel.org/vfs/vfs/c/bf8b9f4ce6a9
[06/14] iomap: rename iomap_readpage_ctx struct to iomap_read_folio_ctx
        https://git.kernel.org/vfs/vfs/c/abea60c60330
[07/14] iomap: track pending read bytes more optimally
        https://git.kernel.org/vfs/vfs/c/13cc90f6c38e
[08/14] iomap: set accurate iter->pos when reading folio ranges
        https://git.kernel.org/vfs/vfs/c/63adb033604e
[09/14] iomap: add caller-provided callbacks for read and readahead
        https://git.kernel.org/vfs/vfs/c/56b6f5d3792b
[10/14] iomap: move buffered io bio logic into new file
        https://git.kernel.org/vfs/vfs/c/80cd9857c47f
[11/14] iomap: make iomap_read_folio() a void return
        https://git.kernel.org/vfs/vfs/c/434651f1a9b7
[12/14] fuse: use iomap for read_folio
        https://git.kernel.org/vfs/vfs/c/12cae30dc565
[13/14] fuse: use iomap for readahead
        https://git.kernel.org/vfs/vfs/c/0853f58ed0b4
[14/14] fuse: remove fc->blkbits workaround for partial writes
        https://git.kernel.org/vfs/vfs/c/bb944dc82db1

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2025-10-27 12:12 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-26  0:25 [PATCH v5 00/14] fuse: use iomap for buffered reads + readahead Joanne Koong
2025-09-26  0:25 ` [PATCH v5 01/14] iomap: move bio read logic into helper function Joanne Koong
2025-09-26  0:25 ` [PATCH v5 02/14] iomap: move read/readahead bio submission " Joanne Koong
2025-09-26  0:25 ` [PATCH v5 03/14] iomap: store read/readahead bio generically Joanne Koong
2025-09-26  0:25 ` [PATCH v5 04/14] iomap: iterate over folio mapping in iomap_readpage_iter() Joanne Koong
2025-09-26  0:26 ` [PATCH v5 05/14] iomap: rename iomap_readpage_iter() to iomap_read_folio_iter() Joanne Koong
2025-09-26  0:26 ` [PATCH v5 06/14] iomap: rename iomap_readpage_ctx struct to iomap_read_folio_ctx Joanne Koong
2025-09-26  0:26 ` [PATCH v5 07/14] iomap: track pending read bytes more optimally Joanne Koong
2025-10-23 19:34   ` Brian Foster
2025-10-24  0:01     ` Joanne Koong
2025-10-24 16:25       ` Joanne Koong
2025-10-24 17:14         ` Brian Foster
2025-10-24 19:48           ` Joanne Koong
2025-10-24 21:55             ` Joanne Koong
2025-10-27 12:16               ` Brian Foster
2025-10-24 17:21         ` Matthew Wilcox
2025-10-24 19:22           ` Joanne Koong
2025-10-24 20:59             ` Matthew Wilcox
2025-10-24 21:37               ` Darrick J. Wong
2025-10-24 21:58               ` Joanne Koong
2025-09-26  0:26 ` [PATCH v5 08/14] iomap: set accurate iter->pos when reading folio ranges Joanne Koong
2025-09-26  0:26 ` [PATCH v5 09/14] iomap: add caller-provided callbacks for read and readahead Joanne Koong
2025-09-26  0:26 ` [PATCH v5 10/14] iomap: move buffered io bio logic into new file Joanne Koong
2025-09-26  0:26 ` [PATCH v5 11/14] iomap: make iomap_read_folio() a void return Joanne Koong
2025-09-26  0:26 ` [PATCH v5 12/14] fuse: use iomap for read_folio Joanne Koong
2025-09-26  0:26 ` [PATCH v5 13/14] fuse: use iomap for readahead Joanne Koong
2025-09-26  0:26 ` [PATCH v5 14/14] fuse: remove fc->blkbits workaround for partial writes Joanne Koong
2025-09-29  9:38 ` [PATCH v5 00/14] fuse: use iomap for buffered reads + readahead Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).