[PATCH v3 0/1] iomap: fix readahead folio access after folio_end

public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 0/1] iomap: fix readahead folio access after folio_end_read()
@ 2026-01-16 20:04 Joanne Koong
  2026-01-16 20:04 ` [PATCH v3 1/1] " Joanne Koong
  0 siblings, 1 reply; 9+ messages in thread
From: Joanne Koong @ 2026-01-16 20:04 UTC (permalink / raw)
  To: brauner; +Cc: willy, djwong, hch, bfoster, linux-fsdevel

This patch is on top of Christian's vfs.fixes tree.

The change in the patch was sanity-tested on fuse and xfs (with large folios
disabled) by running generic/112 for both readahead and read_folio (simulated
by commenting out the ->readahead() callback).

Thanks,
Joanne

Changelog
---------
v2: https://lore.kernel.org/linux-fsdevel/20260116015452.757719-1-joannelkoong@gmail.com/
* Fix incorrect spacing (Matthew)
* Reword commit title and message (Matthew)

v1: https://lore.kernel.org/linux-fsdevel/20260114180255.3043081-1-joannelkoong@gmail.com/
* Invalidate ctx->cur_folio instead of retaining readahead caller refcount (Matthew)

Joanne Koong (1):
  iomap: fix folio access after folio_end_read() in readahead

 fs/iomap/buffered-io.c | 23 +++++++++++++++++++++--
 1 file changed, 21 insertions(+), 2 deletions(-)

-- 
2.47.3


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v3 1/1] iomap: fix readahead folio access after folio_end_read()
  2026-01-16 20:04 [PATCH v3 0/1] iomap: fix readahead folio access after folio_end_read() Joanne Koong
@ 2026-01-16 20:04 ` Joanne Koong
  2026-01-21  7:49   ` Christoph Hellwig
  0 siblings, 1 reply; 9+ messages in thread
From: Joanne Koong @ 2026-01-16 20:04 UTC (permalink / raw)
  To: brauner; +Cc: willy, djwong, hch, bfoster, linux-fsdevel

If the folio does not have an iomap_folio_state struct attached to it
and the folio gets read in by the filesystem's IO helper,
folio_end_read() may have already been called on the folio.

Fix this by invalidating ctx->cur_folio when a folio without
iomap_folio_state metadata attached to it has been handed to the
filesystem's IO helper.

Fixes: b2f35ac4146d ("iomap: add caller-provided callbacks for read and readahead")
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
 fs/iomap/buffered-io.c | 23 +++++++++++++++++++++--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 6beb876658c0..8b7fb33d7212 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -502,6 +502,8 @@ static int iomap_read_folio_iter(struct iomap_iter *iter,
 	loff_t pos = iter->pos;
 	loff_t length = iomap_length(iter);
 	struct folio *folio = ctx->cur_folio;
+	size_t folio_len = folio_size(folio);
+	struct iomap_folio_state *ifs;
 	size_t poff, plen;
 	loff_t pos_diff;
 	int ret;
@@ -513,10 +515,10 @@ static int iomap_read_folio_iter(struct iomap_iter *iter,
 		return iomap_iter_advance(iter, length);
 	}
 
-	ifs_alloc(iter->inode, folio, iter->flags);
+	ifs = ifs_alloc(iter->inode, folio, iter->flags);
 
 	length = min_t(loff_t, length,
-			folio_size(folio) - offset_in_folio(folio, pos));
+			folio_len - offset_in_folio(folio, pos));
 	while (length) {
 		iomap_adjust_read_range(iter->inode, folio, &pos, length, &poff,
 				&plen);
@@ -542,7 +544,24 @@ static int iomap_read_folio_iter(struct iomap_iter *iter,
 			ret = ctx->ops->read_folio_range(iter, ctx, plen);
 			if (ret)
 				return ret;
+
 			*bytes_submitted += plen;
+			/*
+			 * If the folio does not have ifs metadata attached,
+			 * then after ->read_folio_range(), the folio might have
+			 * gotten freed (eg iomap_finish_folio_read() ->
+			 * folio_end_read() followed by page cache eviction,
+			 * which for readahead folios drops the last refcount).
+			 * Invalidate ctx->cur_folio here.
+			 *
+			 * For folios without ifs metadata attached, the read
+			 * should be on the entire folio.
+			 */
+			if (!ifs) {
+				ctx->cur_folio = NULL;
+				if (unlikely(plen != folio_len))
+					return -EIO;
+			}
 		}
 
 		ret = iomap_iter_advance(iter, plen);
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 1/1] iomap: fix readahead folio access after folio_end_read()
  2026-01-16 20:04 ` [PATCH v3 1/1] " Joanne Koong
@ 2026-01-21  7:49   ` Christoph Hellwig
  2026-01-21 23:13     ` Joanne Koong
  0 siblings, 1 reply; 9+ messages in thread
From: Christoph Hellwig @ 2026-01-21  7:49 UTC (permalink / raw)
  To: Joanne Koong; +Cc: brauner, willy, djwong, hch, bfoster, linux-fsdevel

On Fri, Jan 16, 2026 at 12:04:27PM -0800, Joanne Koong wrote:
> If the folio does not have an iomap_folio_state struct attached to it
> and the folio gets read in by the filesystem's IO helper,
> folio_end_read() may have already been called on the folio.

Not just can, but it has to, as there is no other way to track when
folio_end_read would need to be called.

> Fix this by invalidating ctx->cur_folio when a folio without
> iomap_folio_state metadata attached to it has been handed to the
> filesystem's IO helper.

Fix what?

for read_folio nothing ever looks at cur_folio, so I guess that is not
what is being fixed, and it's readahead in some form.  Can you explain
what went wrong and how it is being fixed here a bit better?

>  			*bytes_submitted += plen;
> +			/*
> +			 * If the folio does not have ifs metadata attached,
> +			 * then after ->read_folio_range(), the folio might have
> +			 * gotten freed (eg iomap_finish_folio_read() ->
> +			 * folio_end_read() followed by page cache eviction,
> +			 * which for readahead folios drops the last refcount).
> +			 * Invalidate ctx->cur_folio here.
> +			 *
> +			 * For folios without ifs metadata attached, the read
> +			 * should be on the entire folio.
> +			 */
> +			if (!ifs) {
> +				ctx->cur_folio = NULL;
> +				if (unlikely(plen != folio_len))
> +					return -EIO;
> +			}

I think the sanity check here is an impossible to hit condition and
should probably be a WARN_ON_ONCE?

So to answer the above, I guess the issue that for readahead, the

	if (ctx->cur_folio &&
	    offset_in_folio(ctx->cur_folio, iter->pos) == 0) {

in iomap_readahead_iter does a double completion?

I don't really love how this spreads the cur_folio setup logic from
iomap_readahead_iter to here, but the other options I can think off to
pass that logic to iomap_readahead_iter (overload return value with a
positive return, extra output argument) don't feel all that enticing
either because they'd now have to spread the ifs logic into
iomap_readahead_iter.  So I guess this version it is for now.

Maybe in the future I'll look into moving all the cur_folio logic
from iomap_readahead_iter into iomap_read_folio_iter and see if
that improves things, but that should not get into the way of the
bug fix.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 1/1] iomap: fix readahead folio access after folio_end_read()
  2026-01-21  7:49   ` Christoph Hellwig
@ 2026-01-21 23:13     ` Joanne Koong
  2026-01-22  6:22       ` Christoph Hellwig
  2026-01-22 16:51       ` Matthew Wilcox
  0 siblings, 2 replies; 9+ messages in thread
From: Joanne Koong @ 2026-01-21 23:13 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: brauner, willy, djwong, bfoster, linux-fsdevel

On Tue, Jan 20, 2026 at 11:49 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Fri, Jan 16, 2026 at 12:04:27PM -0800, Joanne Koong wrote:
> > If the folio does not have an iomap_folio_state struct attached to it
> > and the folio gets read in by the filesystem's IO helper,
> > folio_end_read() may have already been called on the folio.
>
> Not just can, but it has to, as there is no other way to track when
> folio_end_read would need to be called.
>
> > Fix this by invalidating ctx->cur_folio when a folio without
> > iomap_folio_state metadata attached to it has been handed to the
> > filesystem's IO helper.
>
> Fix what?
>
> for read_folio nothing ever looks at cur_folio, so I guess that is not
> what is being fixed, and it's readahead in some form.  Can you explain
> what went wrong and how it is being fixed here a bit better?

Thanks for taking a look at this. The problem I'm trying to fix is
this readahead scenario:
* folio with no ifs gets read in by the filesystem through the
->read_folio_range() call
* the filesystem reads in the folio and calls
iomap_finish_folio_read() which calls folio_end_read(), which unlocks
the folio
* then the page cache evicts the folio and drops the last refcount on
the folio which frees the folio and the folio gets repurposed by
another filesystem (eg btrfs) which uses folio->private
* the iomap logic accesses ctx->cur_folio still, and in the call to
iomap_read_end(), it'll detect a non-null folio->private and it'll
assume that's the ifs and it'll try to do stuff like
spin_lock_irq(&ifs->state_lock) which will crash the system.

This is not a problem for folios with an ifs because the +1 bias we
add to ifs->read_bytes_pending makes it so that iomap is the one who
invokes folio_end_read() when it's all done with the folio.

>
> >                       *bytes_submitted += plen;
> > +                     /*
> > +                      * If the folio does not have ifs metadata attached,
> > +                      * then after ->read_folio_range(), the folio might have
> > +                      * gotten freed (eg iomap_finish_folio_read() ->
> > +                      * folio_end_read() followed by page cache eviction,
> > +                      * which for readahead folios drops the last refcount).
> > +                      * Invalidate ctx->cur_folio here.
> > +                      *
> > +                      * For folios without ifs metadata attached, the read
> > +                      * should be on the entire folio.
> > +                      */
> > +                     if (!ifs) {
> > +                             ctx->cur_folio = NULL;
> > +                             if (unlikely(plen != folio_len))
> > +                                     return -EIO;
> > +                     }
>
> I think the sanity check here is an impossible to hit condition and
> should probably be a WARN_ON_ONCE?

I'll be removing this check for the next version.

>
> So to answer the above, I guess the issue that for readahead, the
>
>         if (ctx->cur_folio &&
>             offset_in_folio(ctx->cur_folio, iter->pos) == 0) {
>
> in iomap_readahead_iter does a double completion?
>
> I don't really love how this spreads the cur_folio setup logic from
> iomap_readahead_iter to here, but the other options I can think off to
> pass that logic to iomap_readahead_iter (overload return value with a
> positive return, extra output argument) don't feel all that enticing
> either because they'd now have to spread the ifs logic into
> iomap_readahead_iter.  So I guess this version it is for now.

Looking at this some more, I think we'll need to use ctx->cur_folio
for non-readahead reads as well (eg passing ctx->cur_folio to
iomap_read_end() in iomap_read_folio()). As Matthew pointed out to me
in [1], we still can't access folio->private after folio_end_read()
even if we hold an active refcount on the folio.

Thanks,
Joanne

[1] https://lore.kernel.org/linux-fsdevel/aWmn2FympQXOMst-@casper.infradead.org/

>
> Maybe in the future I'll look into moving all the cur_folio logic
> from iomap_readahead_iter into iomap_read_folio_iter and see if
> that improves things, but that should not get into the way of the
> bug fix.
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 1/1] iomap: fix readahead folio access after folio_end_read()
  2026-01-21 23:13     ` Joanne Koong
@ 2026-01-22  6:22       ` Christoph Hellwig
  2026-01-22 16:51       ` Matthew Wilcox
  1 sibling, 0 replies; 9+ messages in thread
From: Christoph Hellwig @ 2026-01-22  6:22 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Christoph Hellwig, brauner, willy, djwong, bfoster, linux-fsdevel

On Wed, Jan 21, 2026 at 03:13:35PM -0800, Joanne Koong wrote:
> Thanks for taking a look at this. The problem I'm trying to fix is
> this readahead scenario:
> * folio with no ifs gets read in by the filesystem through the
> ->read_folio_range() call
> * the filesystem reads in the folio and calls
> iomap_finish_folio_read() which calls folio_end_read(), which unlocks
> the folio
> * then the page cache evicts the folio and drops the last refcount on
> the folio which frees the folio and the folio gets repurposed by
> another filesystem (eg btrfs) which uses folio->private
> * the iomap logic accesses ctx->cur_folio still, and in the call to
> iomap_read_end(), it'll detect a non-null folio->private and it'll
> assume that's the ifs and it'll try to do stuff like
> spin_lock_irq(&ifs->state_lock) which will crash the system.

Yeah, I think I gather that from the code changes in the patch, but
it helps to state this.  Or in short, this:

		if (ctx->cur_folio &&
                    offset_in_folio(ctx->cur_folio, iter->pos) == 0) {

accesses a potentially freed folio.

> > > +                             if (unlikely(plen != folio_len))
> > > +                                     return -EIO;
> > > +                     }
> >
> > I think the sanity check here is an impossible to hit condition and
> > should probably be a WARN_ON_ONCE?
> 
> I'll be removing this check for the next version.

I think it's good to have the checking, what I mean is to add a
WARN_ON_ONCE to make it stick out as a debug check.

> Looking at this some more, I think we'll need to use ctx->cur_folio
> for non-readahead reads as well (eg passing ctx->cur_folio to
> iomap_read_end() in iomap_read_folio()). As Matthew pointed out to me
> in [1], we still can't access folio->private after folio_end_read()
> even if we hold an active refcount on the folio.

Yeah.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 1/1] iomap: fix readahead folio access after folio_end_read()
  2026-01-21 23:13     ` Joanne Koong
  2026-01-22  6:22       ` Christoph Hellwig
@ 2026-01-22 16:51       ` Matthew Wilcox
  2026-01-22 19:50         ` Joanne Koong
  1 sibling, 1 reply; 9+ messages in thread
From: Matthew Wilcox @ 2026-01-22 16:51 UTC (permalink / raw)
  To: Joanne Koong; +Cc: Christoph Hellwig, brauner, djwong, bfoster, linux-fsdevel

On Wed, Jan 21, 2026 at 03:13:35PM -0800, Joanne Koong wrote:
> Thanks for taking a look at this. The problem I'm trying to fix is
> this readahead scenario:
> * folio with no ifs gets read in by the filesystem through the
> ->read_folio_range() call
> * the filesystem reads in the folio and calls
> iomap_finish_folio_read() which calls folio_end_read(), which unlocks
> the folio
> * then the page cache evicts the folio and drops the last refcount on
> the folio which frees the folio and the folio gets repurposed by
> another filesystem (eg btrfs) which uses folio->private
> * the iomap logic accesses ctx->cur_folio still, and in the call to
> iomap_read_end(), it'll detect a non-null folio->private and it'll
> assume that's the ifs and it'll try to do stuff like
> spin_lock_irq(&ifs->state_lock) which will crash the system.
> 
> This is not a problem for folios with an ifs because the +1 bias we
> add to ifs->read_bytes_pending makes it so that iomap is the one who
> invokes folio_end_read() when it's all done with the folio.

This is so complicated.  I think you made your life harder by adding the
bias to read_bytes_pending.  What if we just rip most of this out ...

(the key here is to call iomap_finish_folio_read() instead of
iomap_set_range_uptodate() when we zero parts of the folio)

I've thrown this at my test VM.  See how it ends up doing.

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 154456e39fe5..c2ff4f561617 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -381,7 +381,7 @@ static int iomap_read_inline_data(const struct iomap_iter *iter,
 		ifs_alloc(iter->inode, folio, iter->flags);
 
 	folio_fill_tail(folio, offset, iomap->inline_data, size);
-	iomap_set_range_uptodate(folio, offset, folio_size(folio) - offset);
+	iomap_finish_folio_read(folio, offset, folio_size(folio) - offset, 0);
 	return 0;
 }
 
@@ -418,92 +418,19 @@ static void iomap_read_init(struct folio *folio)
 	struct iomap_folio_state *ifs = folio->private;
 
 	if (ifs) {
-		size_t len = folio_size(folio);
-
 		/*
-		 * ifs->read_bytes_pending is used to track how many bytes are
-		 * read in asynchronously by the IO helper. We need to track
-		 * this so that we can know when the IO helper has finished
-		 * reading in all the necessary ranges of the folio and can end
-		 * the read.
-		 *
-		 * Increase ->read_bytes_pending by the folio size to start, and
-		 * add a +1 bias. We'll subtract the bias and any uptodate /
-		 * zeroed ranges that did not require IO in iomap_read_end()
-		 * after we're done processing the folio.
-		 *
-		 * We do this because otherwise, we would have to increment
-		 * ifs->read_bytes_pending every time a range in the folio needs
-		 * to be read in, which can get expensive since the spinlock
-		 * needs to be held whenever modifying ifs->read_bytes_pending.
-		 *
-		 * We add the bias to ensure the read has not been ended on the
-		 * folio when iomap_read_end() is called, even if the IO helper
-		 * has already finished reading in the entire folio.
+		 * Initially all bytes in the folio are pending.
+		 * We subtract as either reads complete or we decide
+		 * to memset().  Once the count reaches zero, the read
+		 * is complete.
 		 */
 		spin_lock_irq(&ifs->state_lock);
 		WARN_ON_ONCE(ifs->read_bytes_pending != 0);
-		ifs->read_bytes_pending = len + 1;
+		ifs->read_bytes_pending = folio_size(folio);
 		spin_unlock_irq(&ifs->state_lock);
 	}
 }
 
-/*
- * This ends IO if no bytes were submitted to an IO helper.
- *
- * Otherwise, this calibrates ifs->read_bytes_pending to represent only the
- * submitted bytes (see comment in iomap_read_init()). If all bytes submitted
- * have already been completed by the IO helper, then this will end the read.
- * Else the IO helper will end the read after all submitted ranges have been
- * read.
- */
-static void iomap_read_end(struct folio *folio, size_t bytes_submitted)
-{
-	struct iomap_folio_state *ifs = folio->private;
-
-	if (ifs) {
-		bool end_read, uptodate;
-
-		spin_lock_irq(&ifs->state_lock);
-		if (!ifs->read_bytes_pending) {
-			WARN_ON_ONCE(bytes_submitted);
-			spin_unlock_irq(&ifs->state_lock);
-			folio_unlock(folio);
-			return;
-		}
-
-		/*
-		 * Subtract any bytes that were initially accounted to
-		 * read_bytes_pending but skipped for IO. The +1 accounts for
-		 * the bias we added in iomap_read_init().
-		 */
-		ifs->read_bytes_pending -=
-			(folio_size(folio) + 1 - bytes_submitted);
-
-		/*
-		 * If !ifs->read_bytes_pending, this means all pending reads by
-		 * the IO helper have already completed, which means we need to
-		 * end the folio read here. If ifs->read_bytes_pending != 0,
-		 * the IO helper will end the folio read.
-		 */
-		end_read = !ifs->read_bytes_pending;
-		if (end_read)
-			uptodate = ifs_is_fully_uptodate(folio, ifs);
-		spin_unlock_irq(&ifs->state_lock);
-		if (end_read)
-			folio_end_read(folio, uptodate);
-	} else if (!bytes_submitted) {
-		/*
-		 * If there were no bytes submitted, this means we are
-		 * responsible for unlocking the folio here, since no IO helper
-		 * has taken ownership of it. If there were bytes submitted,
-		 * then the IO helper will end the read via
-		 * iomap_finish_folio_read().
-		 */
-		folio_unlock(folio);
-	}
-}
-
 static int iomap_read_folio_iter(struct iomap_iter *iter,
 		struct iomap_read_folio_ctx *ctx, size_t *bytes_submitted)
 {
@@ -544,7 +471,7 @@ static int iomap_read_folio_iter(struct iomap_iter *iter,
 		/* zero post-eof blocks as the page may be mapped */
 		if (iomap_block_needs_zeroing(iter, pos)) {
 			folio_zero_range(folio, poff, plen);
-			iomap_set_range_uptodate(folio, poff, plen);
+			iomap_finish_folio_read(folio, poff, plen, 0);
 		} else {
 			if (!*bytes_submitted)
 				iomap_read_init(folio);
@@ -588,8 +515,6 @@ void iomap_read_folio(const struct iomap_ops *ops,
 
 	if (ctx->ops->submit_read)
 		ctx->ops->submit_read(ctx);
-
-	iomap_read_end(folio, bytes_submitted);
 }
 EXPORT_SYMBOL_GPL(iomap_read_folio);
 
@@ -601,7 +526,6 @@ static int iomap_readahead_iter(struct iomap_iter *iter,
 	while (iomap_length(iter)) {
 		if (ctx->cur_folio &&
 		    offset_in_folio(ctx->cur_folio, iter->pos) == 0) {
-			iomap_read_end(ctx->cur_folio, *cur_bytes_submitted);
 			ctx->cur_folio = NULL;
 		}
 		if (!ctx->cur_folio) {
@@ -653,9 +577,6 @@ void iomap_readahead(const struct iomap_ops *ops,
 
 	if (ctx->ops->submit_read)
 		ctx->ops->submit_read(ctx);
-
-	if (ctx->cur_folio)
-		iomap_read_end(ctx->cur_folio, cur_bytes_submitted);
 }
 EXPORT_SYMBOL_GPL(iomap_readahead);
 

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 1/1] iomap: fix readahead folio access after folio_end_read()
  2026-01-22 16:51       ` Matthew Wilcox
@ 2026-01-22 19:50         ` Joanne Koong
  2026-01-22 19:56           ` Matthew Wilcox
  0 siblings, 1 reply; 9+ messages in thread
From: Joanne Koong @ 2026-01-22 19:50 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Christoph Hellwig, brauner, djwong, bfoster, linux-fsdevel

On Thu, Jan 22, 2026 at 8:51 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Wed, Jan 21, 2026 at 03:13:35PM -0800, Joanne Koong wrote:
> > Thanks for taking a look at this. The problem I'm trying to fix is
> > this readahead scenario:
> > * folio with no ifs gets read in by the filesystem through the
> > ->read_folio_range() call
> > * the filesystem reads in the folio and calls
> > iomap_finish_folio_read() which calls folio_end_read(), which unlocks
> > the folio
> > * then the page cache evicts the folio and drops the last refcount on
> > the folio which frees the folio and the folio gets repurposed by
> > another filesystem (eg btrfs) which uses folio->private
> > * the iomap logic accesses ctx->cur_folio still, and in the call to
> > iomap_read_end(), it'll detect a non-null folio->private and it'll
> > assume that's the ifs and it'll try to do stuff like
> > spin_lock_irq(&ifs->state_lock) which will crash the system.
> >
> > This is not a problem for folios with an ifs because the +1 bias we
> > add to ifs->read_bytes_pending makes it so that iomap is the one who
> > invokes folio_end_read() when it's all done with the folio.
>
> This is so complicated.  I think you made your life harder by adding the
> bias to read_bytes_pending.  What if we just rip most of this out ...

I don't think we can rip this out because when the read starts,
ifs->read_bytes_pending gets set to the folio size, but if there are
already some uptodate ranges in the folio, the filesystem IO helper
will not be reading in the entire folio size, which means we still
need all this logic to decrement ifs->read_bytes_pending by the bytes
not read in by the IO helper and to end the folio read.

Thanks,
Joanne

>
> (the key here is to call iomap_finish_folio_read() instead of
> iomap_set_range_uptodate() when we zero parts of the folio)
>
> I've thrown this at my test VM.  See how it ends up doing.
>
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 154456e39fe5..c2ff4f561617 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -381,7 +381,7 @@ static int iomap_read_inline_data(const struct iomap_iter *iter,
>                 ifs_alloc(iter->inode, folio, iter->flags);
>
>         folio_fill_tail(folio, offset, iomap->inline_data, size);
> -       iomap_set_range_uptodate(folio, offset, folio_size(folio) - offset);
> +       iomap_finish_folio_read(folio, offset, folio_size(folio) - offset, 0);
>         return 0;
>  }
>
> @@ -418,92 +418,19 @@ static void iomap_read_init(struct folio *folio)
>         struct iomap_folio_state *ifs = folio->private;
>
>         if (ifs) {
> -               size_t len = folio_size(folio);
> -
>                 /*
> -                * ifs->read_bytes_pending is used to track how many bytes are
> -                * read in asynchronously by the IO helper. We need to track
> -                * this so that we can know when the IO helper has finished
> -                * reading in all the necessary ranges of the folio and can end
> -                * the read.
> -                *
> -                * Increase ->read_bytes_pending by the folio size to start, and
> -                * add a +1 bias. We'll subtract the bias and any uptodate /
> -                * zeroed ranges that did not require IO in iomap_read_end()
> -                * after we're done processing the folio.
> -                *
> -                * We do this because otherwise, we would have to increment
> -                * ifs->read_bytes_pending every time a range in the folio needs
> -                * to be read in, which can get expensive since the spinlock
> -                * needs to be held whenever modifying ifs->read_bytes_pending.
> -                *
> -                * We add the bias to ensure the read has not been ended on the
> -                * folio when iomap_read_end() is called, even if the IO helper
> -                * has already finished reading in the entire folio.
> +                * Initially all bytes in the folio are pending.
> +                * We subtract as either reads complete or we decide
> +                * to memset().  Once the count reaches zero, the read
> +                * is complete.
>                  */
>                 spin_lock_irq(&ifs->state_lock);
>                 WARN_ON_ONCE(ifs->read_bytes_pending != 0);
> -               ifs->read_bytes_pending = len + 1;
> +               ifs->read_bytes_pending = folio_size(folio);
>                 spin_unlock_irq(&ifs->state_lock);
>         }
>  }
>
> -/*
> - * This ends IO if no bytes were submitted to an IO helper.
> - *
> - * Otherwise, this calibrates ifs->read_bytes_pending to represent only the
> - * submitted bytes (see comment in iomap_read_init()). If all bytes submitted
> - * have already been completed by the IO helper, then this will end the read.
> - * Else the IO helper will end the read after all submitted ranges have been
> - * read.
> - */
> -static void iomap_read_end(struct folio *folio, size_t bytes_submitted)
> -{
> -       struct iomap_folio_state *ifs = folio->private;
> -
> -       if (ifs) {
> -               bool end_read, uptodate;
> -
> -               spin_lock_irq(&ifs->state_lock);
> -               if (!ifs->read_bytes_pending) {
> -                       WARN_ON_ONCE(bytes_submitted);
> -                       spin_unlock_irq(&ifs->state_lock);
> -                       folio_unlock(folio);
> -                       return;
> -               }
> -
> -               /*
> -                * Subtract any bytes that were initially accounted to
> -                * read_bytes_pending but skipped for IO. The +1 accounts for
> -                * the bias we added in iomap_read_init().
> -                */
> -               ifs->read_bytes_pending -=
> -                       (folio_size(folio) + 1 - bytes_submitted);
> -
> -               /*
> -                * If !ifs->read_bytes_pending, this means all pending reads by
> -                * the IO helper have already completed, which means we need to
> -                * end the folio read here. If ifs->read_bytes_pending != 0,
> -                * the IO helper will end the folio read.
> -                */
> -               end_read = !ifs->read_bytes_pending;
> -               if (end_read)
> -                       uptodate = ifs_is_fully_uptodate(folio, ifs);
> -               spin_unlock_irq(&ifs->state_lock);
> -               if (end_read)
> -                       folio_end_read(folio, uptodate);
> -       } else if (!bytes_submitted) {
> -               /*
> -                * If there were no bytes submitted, this means we are
> -                * responsible for unlocking the folio here, since no IO helper
> -                * has taken ownership of it. If there were bytes submitted,
> -                * then the IO helper will end the read via
> -                * iomap_finish_folio_read().
> -                */
> -               folio_unlock(folio);
> -       }
> -}
> -
>  static int iomap_read_folio_iter(struct iomap_iter *iter,
>                 struct iomap_read_folio_ctx *ctx, size_t *bytes_submitted)
>  {
> @@ -544,7 +471,7 @@ static int iomap_read_folio_iter(struct iomap_iter *iter,
>                 /* zero post-eof blocks as the page may be mapped */
>                 if (iomap_block_needs_zeroing(iter, pos)) {
>                         folio_zero_range(folio, poff, plen);
> -                       iomap_set_range_uptodate(folio, poff, plen);
> +                       iomap_finish_folio_read(folio, poff, plen, 0);
>                 } else {
>                         if (!*bytes_submitted)
>                                 iomap_read_init(folio);
> @@ -588,8 +515,6 @@ void iomap_read_folio(const struct iomap_ops *ops,
>
>         if (ctx->ops->submit_read)
>                 ctx->ops->submit_read(ctx);
> -
> -       iomap_read_end(folio, bytes_submitted);
>  }
>  EXPORT_SYMBOL_GPL(iomap_read_folio);
>
> @@ -601,7 +526,6 @@ static int iomap_readahead_iter(struct iomap_iter *iter,
>         while (iomap_length(iter)) {
>                 if (ctx->cur_folio &&
>                     offset_in_folio(ctx->cur_folio, iter->pos) == 0) {
> -                       iomap_read_end(ctx->cur_folio, *cur_bytes_submitted);
>                         ctx->cur_folio = NULL;
>                 }
>                 if (!ctx->cur_folio) {
> @@ -653,9 +577,6 @@ void iomap_readahead(const struct iomap_ops *ops,
>
>         if (ctx->ops->submit_read)
>                 ctx->ops->submit_read(ctx);
> -
> -       if (ctx->cur_folio)
> -               iomap_read_end(ctx->cur_folio, cur_bytes_submitted);
>  }
>  EXPORT_SYMBOL_GPL(iomap_readahead);
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 1/1] iomap: fix readahead folio access after folio_end_read()
  2026-01-22 19:50         ` Joanne Koong
@ 2026-01-22 19:56           ` Matthew Wilcox
  2026-01-22 23:05             ` Joanne Koong
  0 siblings, 1 reply; 9+ messages in thread
From: Matthew Wilcox @ 2026-01-22 19:56 UTC (permalink / raw)
  To: Joanne Koong; +Cc: Christoph Hellwig, brauner, djwong, bfoster, linux-fsdevel

On Thu, Jan 22, 2026 at 11:50:18AM -0800, Joanne Koong wrote:
> On Thu, Jan 22, 2026 at 8:51 AM Matthew Wilcox <willy@infradead.org> wrote:
> > This is so complicated.  I think you made your life harder by adding the
> > bias to read_bytes_pending.  What if we just rip most of this out ...
> 
> I don't think we can rip this out because when the read starts,
> ifs->read_bytes_pending gets set to the folio size, but if there are
> already some uptodate ranges in the folio, the filesystem IO helper
> will not be reading in the entire folio size, which means we still
> need all this logic to decrement ifs->read_bytes_pending by the bytes
> not read in by the IO helper and to end the folio read.

Well, the patch as-is doesn't work (squinting at it now to see
what I missed ...), but that's not an insurmountable problem.
If we find already-uptodate blocks in the folio, we can just call
iomap_finish_folio_read() for them without starting I/O.  That's very
much a corner case, so we need not try hard for efficiency (eg skipping
the ifs_set_range_uptodate() call).

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 1/1] iomap: fix readahead folio access after folio_end_read()
  2026-01-22 19:56           ` Matthew Wilcox
@ 2026-01-22 23:05             ` Joanne Koong
  0 siblings, 0 replies; 9+ messages in thread
From: Joanne Koong @ 2026-01-22 23:05 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Christoph Hellwig, brauner, djwong, bfoster, linux-fsdevel

On Thu, Jan 22, 2026 at 11:56 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Jan 22, 2026 at 11:50:18AM -0800, Joanne Koong wrote:
> > On Thu, Jan 22, 2026 at 8:51 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > This is so complicated.  I think you made your life harder by adding the
> > > bias to read_bytes_pending.  What if we just rip most of this out ...
> >
> > I don't think we can rip this out because when the read starts,
> > ifs->read_bytes_pending gets set to the folio size, but if there are
> > already some uptodate ranges in the folio, the filesystem IO helper
> > will not be reading in the entire folio size, which means we still
> > need all this logic to decrement ifs->read_bytes_pending by the bytes
> > not read in by the IO helper and to end the folio read.
>
> Well, the patch as-is doesn't work (squinting at it now to see
> what I missed ...), but that's not an insurmountable problem.
> If we find already-uptodate blocks in the folio, we can just call
> iomap_finish_folio_read() for them without starting I/O.  That's very
> much a corner case, so we need not try hard for efficiency (eg skipping
> the ifs_set_range_uptodate() call).

I tried coding this out but I think it ends up looking just as
unappealing. To cover all the edge cases, I think it'd have to look
something like this:

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 6beb876658c0..d01b760ee3fe 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -409,8 +409,6 @@ static void iomap_read_init(struct folio *folio)
        struct iomap_folio_state *ifs = folio->private;

        if (ifs) {
-               size_t len = folio_size(folio);
-
                /*
                 * ifs->read_bytes_pending is used to track how many bytes are
                 * read in asynchronously by the IO helper. We need to track
@@ -418,80 +416,20 @@ static void iomap_read_init(struct folio *folio)
                 * reading in all the necessary ranges of the folio and can end
                 * the read.
                 *
-                * Increase ->read_bytes_pending by the folio size to start, and
-                * add a +1 bias. We'll subtract the bias and any uptodate /
-                * zeroed ranges that did not require IO in iomap_read_end()
-                * after we're done processing the folio.
+                * Increase ->read_bytes_pending by the folio size to start.
+                * We'll subtract any uptodate / zeroed ranges that did not
+                * require IO in iomap_finish_folio_read() after we're done
+                * processing the folio.
                 *
                 * We do this because otherwise, we would have to increment
                 * ifs->read_bytes_pending every time a range in the folio needs
                 * to be read in, which can get expensive since the spinlock
                 * needs to be held whenever modifying ifs->read_bytes_pending.
-                *
-                * We add the bias to ensure the read has not been ended on the
-                * folio when iomap_read_end() is called, even if the IO helper
-                * has already finished reading in the entire folio.
                 */
                spin_lock_irq(&ifs->state_lock);
                WARN_ON_ONCE(ifs->read_bytes_pending != 0);
-               ifs->read_bytes_pending = len + 1;
-               spin_unlock_irq(&ifs->state_lock);
-       }
-}
-
-/*
- * This ends IO if no bytes were submitted to an IO helper.
- *
- * Otherwise, this calibrates ifs->read_bytes_pending to represent only the
- * submitted bytes (see comment in iomap_read_init()). If all bytes submitted
- * have already been completed by the IO helper, then this will end the read.
- * Else the IO helper will end the read after all submitted ranges have been
- * read.
- */
-static void iomap_read_end(struct folio *folio, size_t bytes_submitted)
-{
-       struct iomap_folio_state *ifs = folio->private;
-
-       if (ifs) {
-               bool end_read, uptodate;
-
-               spin_lock_irq(&ifs->state_lock);
-               if (!ifs->read_bytes_pending) {
-                       WARN_ON_ONCE(bytes_submitted);
-                       spin_unlock_irq(&ifs->state_lock);
-                       folio_unlock(folio);
-                       return;
-               }
-
-               /*
-                * Subtract any bytes that were initially accounted to
-                * read_bytes_pending but skipped for IO. The +1 accounts for
-                * the bias we added in iomap_read_init().
-                */
-               ifs->read_bytes_pending -=
-                       (folio_size(folio) + 1 - bytes_submitted);
-
-               /*
-                * If !ifs->read_bytes_pending, this means all pending reads by
-                * the IO helper have already completed, which means we need to
-                * end the folio read here. If ifs->read_bytes_pending != 0,
-                * the IO helper will end the folio read.
-                */
-               end_read = !ifs->read_bytes_pending;
-               if (end_read)
-                       uptodate = ifs_is_fully_uptodate(folio, ifs);
+               ifs->read_bytes_pending = folio_size(folio);
                spin_unlock_irq(&ifs->state_lock);
-               if (end_read)
-                       folio_end_read(folio, uptodate);
-       } else if (!bytes_submitted) {
-               /*
-                * If there were no bytes submitted, this means we are
-                * responsible for unlocking the folio here, since no IO helper
-                * has taken ownership of it. If there were bytes submitted,
-                * then the IO helper will end the read via
-                * iomap_finish_folio_read().
-                */
-               folio_unlock(folio);
        }
 }

@@ -502,10 +440,16 @@ static int iomap_read_folio_iter(struct iomap_iter *iter,
        loff_t pos = iter->pos;
        loff_t length = iomap_length(iter);
        struct folio *folio = ctx->cur_folio;
+       size_t folio_len = folio_size(folio);
        size_t poff, plen;
        loff_t pos_diff;
        int ret;

+       ifs_alloc(iter->inode, folio, iter->flags);
+
+       if (!*bytes_submitted)
+               iomap_read_init(folio);
+
        if (iomap->type == IOMAP_INLINE) {
                ret = iomap_read_inline_data(iter, folio);
                if (ret)
@@ -513,8 +457,6 @@ static int iomap_read_folio_iter(struct iomap_iter *iter,
                return iomap_iter_advance(iter, length);
        }

-       ifs_alloc(iter->inode, folio, iter->flags);
-
        length = min_t(loff_t, length,
                        folio_size(folio) - offset_in_folio(folio, pos));
        while (length) {
@@ -537,12 +479,12 @@ static int iomap_read_folio_iter(struct iomap_iter *iter,
                        folio_zero_range(folio, poff, plen);
                        iomap_set_range_uptodate(folio, poff, plen);
                } else {
-                       if (!*bytes_submitted)
-                               iomap_read_init(folio);
                        ret = ctx->ops->read_folio_range(iter, ctx, plen);
                        if (ret)
                                return ret;
                        *bytes_submitted += plen;
+                       if (*bytes_submitted == folio_len)
+                               ctx->cur_folio = false;
                }

                ret = iomap_iter_advance(iter, plen);
@@ -558,36 +500,51 @@ void iomap_read_folio(const struct iomap_ops *ops,
                struct iomap_read_folio_ctx *ctx)
 {
        struct folio *folio = ctx->cur_folio;
+       size_t folio_len = folio_size(folio);
        struct iomap_iter iter = {
                .inode          = folio->mapping->host,
                .pos            = folio_pos(folio),
-               .len            = folio_size(folio),
+               .len            = folio_len,
        };
        size_t bytes_submitted = 0;
+       bool read_initialized = false;
        int ret;

        trace_iomap_readpage(iter.inode, 1);

-       while ((ret = iomap_iter(&iter, ops)) > 0)
+       while ((ret = iomap_iter(&iter, ops)) > 0) {
                iter.status = iomap_read_folio_iter(&iter, ctx,
                                &bytes_submitted);
+               read_initialized = true;
+       }

        if (ctx->ops->submit_read)
                ctx->ops->submit_read(ctx);

-       iomap_read_end(folio, bytes_submitted);
+       if (ctx->cur_folio) {
+               if (read_initialized)
+                       iomap_finish_folio_read(folio, 0,
+                                       folio_len - bytes_submitted,
+                                       iter.status);
+               else
+                       folio_unlock(folio);
+       }
 }
 EXPORT_SYMBOL_GPL(iomap_read_folio);

 static int iomap_readahead_iter(struct iomap_iter *iter,
-               struct iomap_read_folio_ctx *ctx, size_t *cur_bytes_submitted)
+               struct iomap_read_folio_ctx *ctx, size_t *cur_bytes_submitted,
+               bool *cur_read_initialized)
 {
        int ret;

        while (iomap_length(iter)) {
                if (ctx->cur_folio &&
                    offset_in_folio(ctx->cur_folio, iter->pos) == 0) {
-                       iomap_read_end(ctx->cur_folio, *cur_bytes_submitted);
+                       if (*cur_read_initialized)
+                               iomap_finish_folio_read(ctx->cur_folio, 0,
+
folio_size(ctx->cur_folio) - *cur_bytes_submitted,
+                                               0);
                        ctx->cur_folio = NULL;
                }
                if (!ctx->cur_folio) {
@@ -595,8 +552,10 @@ static int iomap_readahead_iter(struct iomap_iter *iter,
                        if (WARN_ON_ONCE(!ctx->cur_folio))
                                return -EINVAL;
                        *cur_bytes_submitted = 0;
+                       *cur_read_initialized = false;
                }
                ret = iomap_read_folio_iter(iter, ctx, cur_bytes_submitted);
+               *cur_read_initialized = true;
                if (ret)
                        return ret;
        }
@@ -629,18 +588,26 @@ void iomap_readahead(const struct iomap_ops *ops,
                .len    = readahead_length(rac),
        };
        size_t cur_bytes_submitted;
+       bool cur_read_initialized;

        trace_iomap_readahead(rac->mapping->host, readahead_count(rac));

        while (iomap_iter(&iter, ops) > 0)
                iter.status = iomap_readahead_iter(&iter, ctx,
-                                       &cur_bytes_submitted);
+                                       &cur_bytes_submitted,
&cur_read_initialized);

        if (ctx->ops->submit_read)
                ctx->ops->submit_read(ctx);

-       if (ctx->cur_folio)
-               iomap_read_end(ctx->cur_folio, cur_bytes_submitted);
+       if (ctx->cur_folio) {
+               if (cur_read_initialized) {
+                       iomap_finish_folio_read(ctx->cur_folio, 0,
+                                       folio_size(ctx->cur_folio) -
cur_bytes_submitted,
+                                       iter.status);
+               } else {
+                       folio_unlock(ctx->cur_folio);
+               }
+       }
 }
 EXPORT_SYMBOL_GPL(iomap_readahead);

In particular, we need to move the ifs_alloc() and iomap_read_init()
calls to before the "iomap_read_inline_data()" call and keep track of
read_initialized as a separate variable that is passed through
iomap_read_folio() and iomap_readahead(), to account for the case
where there's an ifs allocated but no call to iomap_read_init() has
been made (which can happen if the ifs was allocated from a prior
write to a block in the folio).

iomap_read_end() guarded against this with:
    if (ifs) {
               ...
                spin_lock_irq(&ifs->state_lock);
                if (!ifs->read_bytes_pending) {
                        WARN_ON_ONCE(bytes_submitted);
                        spin_unlock_irq(&ifs->state_lock);
                        folio_unlock(folio);
                        return;
                }
    }

In the existing implementation, if we invalidate ctx->cur_folio when
the entire folio is read in, then we could remove the "+1"
addition/subtraction, but I think the part you're objecting to in
general is having a separate iomap_read_end() instead of just
repurposing the logic in iomap_finish_folio_read().

Thanks,
Joanne

^ permalink raw reply related	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2026-01-22 23:05 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-16 20:04 [PATCH v3 0/1] iomap: fix readahead folio access after folio_end_read() Joanne Koong
2026-01-16 20:04 ` [PATCH v3 1/1] " Joanne Koong
2026-01-21  7:49   ` Christoph Hellwig
2026-01-21 23:13     ` Joanne Koong
2026-01-22  6:22       ` Christoph Hellwig
2026-01-22 16:51       ` Matthew Wilcox
2026-01-22 19:50         ` Joanne Koong
2026-01-22 19:56           ` Matthew Wilcox
2026-01-22 23:05             ` Joanne Koong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox