public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/1] iomap: fix readahead folio refcounting race
@ 2026-01-16  1:54 Joanne Koong
  2026-01-16  1:54 ` [PATCH v2 1/1] " Joanne Koong
  0 siblings, 1 reply; 12+ messages in thread
From: Joanne Koong @ 2026-01-16  1:54 UTC (permalink / raw)
  To: brauner; +Cc: willy, djwong, hch, bfoster, linux-fsdevel

This patch is on top of Christian's vfs.fixes tree.

This fixes a race for readahead folios that was introduced by commit
b2f35ac4146d ("iomap: add caller-provided callbacks for read and readahead").

The change in the patch was sanity-tested on fuse and xfs (with large folios
disabled) by running generic/112 for both readahead and read_folio (simulated
by commenting out the ->readahead() callback).

Changelog
---------
v1: https://lore.kernel.org/linux-fsdevel/20260114180255.3043081-1-joannelkoong@gmail.com/
* Invalidate ctx->cur_folio instead of retaining readahead caller refcount (Matthew)

Joanne Koong (1):
  iomap: fix readahead folio refcounting race

 fs/iomap/buffered-io.c | 23 +++++++++++++++++++++--
 1 file changed, 21 insertions(+), 2 deletions(-)

-- 
2.47.3


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v2 1/1] iomap: fix readahead folio refcounting race
  2026-01-16  1:54 [PATCH v2 0/1] iomap: fix readahead folio refcounting race Joanne Koong
@ 2026-01-16  1:54 ` Joanne Koong
  2026-01-16  2:52   ` Matthew Wilcox
  0 siblings, 1 reply; 12+ messages in thread
From: Joanne Koong @ 2026-01-16  1:54 UTC (permalink / raw)
  To: brauner; +Cc: willy, djwong, hch, bfoster, linux-fsdevel

readahead_folio() returns the next folio from the readahead control
(rac) but it also drops the refcount on the folio that had been held by
the rac. As such, there is only one refcount remaining on the folio
(which is held by the page cache) after this returns.

This is problematic because this opens a race where if the folio does
not have an iomap_folio_state struct attached to it and the folio gets
read in by the filesystem's IO helper, folio_end_read() may have already
been called on the folio (which will unlock the folio) which allows the
page cache to evict the folio (dropping the refcount and leading to the
folio being freed), which leads to use-after-free issues when
subsequent logic in iomap_readahead_iter() or iomap_read_end() accesses
that folio.

Fix this by invalidating ctx->cur_folio when a folio without
iomap_folio_state metadata attached to it has been handed to the
filesystem's IO helper.

Fixes: b2f35ac4146d ("iomap: add caller-provided callbacks for read and readahead")
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
 fs/iomap/buffered-io.c | 23 +++++++++++++++++++++--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 6beb876658c0..2243399d70b5 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -502,6 +502,8 @@ static int iomap_read_folio_iter(struct iomap_iter *iter,
 	loff_t pos = iter->pos;
 	loff_t length = iomap_length(iter);
 	struct folio *folio = ctx->cur_folio;
+	size_t folio_len = folio_size(folio);
+	struct iomap_folio_state *ifs;
 	size_t poff, plen;
 	loff_t pos_diff;
 	int ret;
@@ -513,10 +515,10 @@ static int iomap_read_folio_iter(struct iomap_iter *iter,
 		return iomap_iter_advance(iter, length);
 	}
 
-	ifs_alloc(iter->inode, folio, iter->flags);
+	ifs = ifs_alloc(iter->inode, folio, iter->flags);
 
 	length = min_t(loff_t, length,
-			folio_size(folio) - offset_in_folio(folio, pos));
+			folio_len - offset_in_folio(folio, pos));
 	while (length) {
 		iomap_adjust_read_range(iter->inode, folio, &pos, length, &poff,
 				&plen);
@@ -542,7 +544,24 @@ static int iomap_read_folio_iter(struct iomap_iter *iter,
 			ret = ctx->ops->read_folio_range(iter, ctx, plen);
 			if (ret)
 				return ret;
+
 			*bytes_submitted += plen;
+			/*
+			 * If the folio does not have ifs metadata attached,
+			 * then after ->read_folio_range(), the folio might have
+			 * gotten freed (eg iomap_finish_folio_read() ->
+			 * folio_end_read() followed by page cache eviction,
+			 * which for readahead folios drops the last refcount).
+			 * Invalidate ctx->cur_folio here.
+			 *
+			 * For folios without ifs metadata attached, the read
+			 * should be on the entire folio.
+			 */
+			if (!ifs) {
+				ctx->cur_folio = NULL;
+				if (unlikely(plen != folio_len))
+				    return -EIO;
+			}
 		}
 
 		ret = iomap_iter_advance(iter, plen);
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/1] iomap: fix readahead folio refcounting race
  2026-01-16  1:54 ` [PATCH v2 1/1] " Joanne Koong
@ 2026-01-16  2:52   ` Matthew Wilcox
  2026-01-16 18:36     ` Joanne Koong
  0 siblings, 1 reply; 12+ messages in thread
From: Matthew Wilcox @ 2026-01-16  2:52 UTC (permalink / raw)
  To: Joanne Koong; +Cc: brauner, djwong, hch, bfoster, linux-fsdevel

On Thu, Jan 15, 2026 at 05:54:52PM -0800, Joanne Koong wrote:
> readahead_folio() returns the next folio from the readahead control
> (rac) but it also drops the refcount on the folio that had been held by
> the rac. As such, there is only one refcount remaining on the folio
> (which is held by the page cache) after this returns.
> 
> This is problematic because this opens a race where if the folio does
> not have an iomap_folio_state struct attached to it and the folio gets
> read in by the filesystem's IO helper, folio_end_read() may have already
> been called on the folio (which will unlock the folio) which allows the
> page cache to evict the folio (dropping the refcount and leading to the
> folio being freed), which leads to use-after-free issues when
> subsequent logic in iomap_readahead_iter() or iomap_read_end() accesses
> that folio.

This explanation is overly complex to the point of being misleading.
If it reflects your current thinking (as opposed to being copied over
from the previous version), it explains why you're having trouble.

The rule is simple.  Once you call folio_end_read(), the folio is
not yours any more.  You can't touch it again; you can't call
folio_size(), you can't call folio_end_read() again.

This discourse about refcounts and descriptions of how the page cache
currently behaves is unnecessary and confusing; it's something that's
going to change in the future and it's not relevant to filesystem authors.

So let's write something simple:

If the folio does not have an iomap_folio_state struct attached to it and
the folio gets read in by the filesystem's IO helper, folio_end_read()
may have already been called on the folio

Fix this by invalidating ctx->cur_folio when a folio without
iomap_folio_state metadata attached to it has been handed to the
filesystem's IO helper.

Fixes: b2f35ac4146d ("iomap: add caller-provided callbacks for read and readahead")
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>

> +			if (!ifs) {
> +				ctx->cur_folio = NULL;
> +				if (unlikely(plen != folio_len))
> +				    return -EIO;

This should be indented with a tab, not four spaces.  Can it even
happen?  If we didn't attach an ifs, can we do a short read?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/1] iomap: fix readahead folio refcounting race
  2026-01-16  2:52   ` Matthew Wilcox
@ 2026-01-16 18:36     ` Joanne Koong
  2026-01-16 21:45       ` Matthew Wilcox
  0 siblings, 1 reply; 12+ messages in thread
From: Joanne Koong @ 2026-01-16 18:36 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: brauner, djwong, hch, bfoster, linux-fsdevel

On Thu, Jan 15, 2026 at 6:52 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Jan 15, 2026 at 05:54:52PM -0800, Joanne Koong wrote:
> > readahead_folio() returns the next folio from the readahead control
> > (rac) but it also drops the refcount on the folio that had been held by
> > the rac. As such, there is only one refcount remaining on the folio
> > (which is held by the page cache) after this returns.
> >
> > This is problematic because this opens a race where if the folio does
> > not have an iomap_folio_state struct attached to it and the folio gets
> > read in by the filesystem's IO helper, folio_end_read() may have already
> > been called on the folio (which will unlock the folio) which allows the
> > page cache to evict the folio (dropping the refcount and leading to the
> > folio being freed), which leads to use-after-free issues when
> > subsequent logic in iomap_readahead_iter() or iomap_read_end() accesses
> > that folio.
>
> This explanation is overly complex to the point of being misleading.
> If it reflects your current thinking (as opposed to being copied over
> from the previous version), it explains why you're having trouble.
>
> The rule is simple.  Once you call folio_end_read(), the folio is
> not yours any more.  You can't touch it again; you can't call
> folio_size(), you can't call folio_end_read() again.

Oh I see, I was under the assumption the folio can still be accessed
afterwards so long as you hold a refcount on it. Thanks for clearing
this up.

>
> This discourse about refcounts and descriptions of how the page cache
> currently behaves is unnecessary and confusing; it's something that's
> going to change in the future and it's not relevant to filesystem authors.
>
> So let's write something simple:
>
> If the folio does not have an iomap_folio_state struct attached to it and
> the folio gets read in by the filesystem's IO helper, folio_end_read()
> may have already been called on the folio
>
> Fix this by invalidating ctx->cur_folio when a folio without
> iomap_folio_state metadata attached to it has been handed to the
> filesystem's IO helper.
>
> Fixes: b2f35ac4146d ("iomap: add caller-provided callbacks for read and readahead")
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>

I'll send v3 with this commit message.

>
> > +                     if (!ifs) {
> > +                             ctx->cur_folio = NULL;
> > +                             if (unlikely(plen != folio_len))
> > +                                 return -EIO;
>
> This should be indented with a tab, not four spaces.  Can it even
> happen?  If we didn't attach an ifs, can we do a short read?

The short read can happen if the filesystem sets the iomap length to a
size that's less than the folio size. plen is determined by
iomap_length() (which returns the minimum of the iter->len and the
iomap length value the filesystem set).

Thanks,
Joanne

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/1] iomap: fix readahead folio refcounting race
  2026-01-16 18:36     ` Joanne Koong
@ 2026-01-16 21:45       ` Matthew Wilcox
  2026-01-16 22:02         ` Joanne Koong
  0 siblings, 1 reply; 12+ messages in thread
From: Matthew Wilcox @ 2026-01-16 21:45 UTC (permalink / raw)
  To: Joanne Koong; +Cc: brauner, djwong, hch, bfoster, linux-fsdevel

On Fri, Jan 16, 2026 at 10:36:25AM -0800, Joanne Koong wrote:
> On Thu, Jan 15, 2026 at 6:52 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > > +                     if (!ifs) {
> > > +                             ctx->cur_folio = NULL;
> > > +                             if (unlikely(plen != folio_len))
> > > +                                 return -EIO;
> >
> > This should be indented with a tab, not four spaces.  Can it even
> > happen?  If we didn't attach an ifs, can we do a short read?
> 
> The short read can happen if the filesystem sets the iomap length to a
> size that's less than the folio size. plen is determined by
> iomap_length() (which returns the minimum of the iter->len and the
> iomap length value the filesystem set).

Understood, but if plen is less than folio_size(), don't we allocate
an ifs?  So !ifs && plen < folio_size() shouldn't be possible?  Or have
I misunderstood this part?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/1] iomap: fix readahead folio refcounting race
  2026-01-16 21:45       ` Matthew Wilcox
@ 2026-01-16 22:02         ` Joanne Koong
  2026-01-17  2:30           ` Darrick J. Wong
  0 siblings, 1 reply; 12+ messages in thread
From: Joanne Koong @ 2026-01-16 22:02 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: brauner, djwong, hch, bfoster, linux-fsdevel

On Fri, Jan 16, 2026 at 1:45 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Jan 16, 2026 at 10:36:25AM -0800, Joanne Koong wrote:
> > On Thu, Jan 15, 2026 at 6:52 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > > +                     if (!ifs) {
> > > > +                             ctx->cur_folio = NULL;
> > > > +                             if (unlikely(plen != folio_len))
> > > > +                                 return -EIO;
> > >
> > > This should be indented with a tab, not four spaces.  Can it even
> > > happen?  If we didn't attach an ifs, can we do a short read?
> >
> > The short read can happen if the filesystem sets the iomap length to a
> > size that's less than the folio size. plen is determined by
> > iomap_length() (which returns the minimum of the iter->len and the
> > iomap length value the filesystem set).
>
> Understood, but if plen is less than folio_size(), don't we allocate
> an ifs?  So !ifs && plen < folio_size() shouldn't be possible?  Or have
> I misunderstood this part?

Maybe I'm misunderstanding this, but I'm pretty sure the ifs is only
allocated if the inode's block size is less than the folio size, and
is not based on plen. The logic I'm looking at is the code inside
ifs_alloc().

Thanks,
Joanne

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/1] iomap: fix readahead folio refcounting race
  2026-01-16 22:02         ` Joanne Koong
@ 2026-01-17  2:30           ` Darrick J. Wong
  2026-01-21  0:34             ` Joanne Koong
  0 siblings, 1 reply; 12+ messages in thread
From: Darrick J. Wong @ 2026-01-17  2:30 UTC (permalink / raw)
  To: Joanne Koong; +Cc: Matthew Wilcox, brauner, hch, bfoster, linux-fsdevel

On Fri, Jan 16, 2026 at 02:02:20PM -0800, Joanne Koong wrote:
> On Fri, Jan 16, 2026 at 1:45 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Fri, Jan 16, 2026 at 10:36:25AM -0800, Joanne Koong wrote:
> > > On Thu, Jan 15, 2026 at 6:52 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > > +                     if (!ifs) {
> > > > > +                             ctx->cur_folio = NULL;
> > > > > +                             if (unlikely(plen != folio_len))
> > > > > +                                 return -EIO;
> > > >
> > > > This should be indented with a tab, not four spaces.  Can it even
> > > > happen?  If we didn't attach an ifs, can we do a short read?
> > >
> > > The short read can happen if the filesystem sets the iomap length to a
> > > size that's less than the folio size. plen is determined by
> > > iomap_length() (which returns the minimum of the iter->len and the
> > > iomap length value the filesystem set).
> >
> > Understood, but if plen is less than folio_size(), don't we allocate
> > an ifs?  So !ifs && plen < folio_size() shouldn't be possible?  Or have
> > I misunderstood this part?
> 
> Maybe I'm misunderstanding this, but I'm pretty sure the ifs is only
> allocated if the inode's block size is less than the folio size, and
> is not based on plen. The logic I'm looking at is the code inside
> ifs_alloc().

Hrmm.  If there's no ifs then blocksize == foliosize, so if
plen < foliosize then that means we're not fully reading in the folio
contents?  That doesn't sound good, especially if the folio can be
mmapped into someone's address space.

--D

> Thanks,
> Joanne
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/1] iomap: fix readahead folio refcounting race
  2026-01-17  2:30           ` Darrick J. Wong
@ 2026-01-21  0:34             ` Joanne Koong
  2026-01-21  1:07               ` Matthew Wilcox
  0 siblings, 1 reply; 12+ messages in thread
From: Joanne Koong @ 2026-01-21  0:34 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Matthew Wilcox, brauner, hch, bfoster, linux-fsdevel

On Fri, Jan 16, 2026 at 6:30 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Fri, Jan 16, 2026 at 02:02:20PM -0800, Joanne Koong wrote:
> > On Fri, Jan 16, 2026 at 1:45 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Fri, Jan 16, 2026 at 10:36:25AM -0800, Joanne Koong wrote:
> > > > On Thu, Jan 15, 2026 at 6:52 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > >
> > > > > > +                     if (!ifs) {
> > > > > > +                             ctx->cur_folio = NULL;
> > > > > > +                             if (unlikely(plen != folio_len))
> > > > > > +                                 return -EIO;
> > > > >
> > > > > This should be indented with a tab, not four spaces.  Can it even
> > > > > happen?  If we didn't attach an ifs, can we do a short read?
> > > >
> > > > The short read can happen if the filesystem sets the iomap length to a
> > > > size that's less than the folio size. plen is determined by
> > > > iomap_length() (which returns the minimum of the iter->len and the
> > > > iomap length value the filesystem set).
> > >
> > > Understood, but if plen is less than folio_size(), don't we allocate
> > > an ifs?  So !ifs && plen < folio_size() shouldn't be possible?  Or have
> > > I misunderstood this part?
> >
> > Maybe I'm misunderstanding this, but I'm pretty sure the ifs is only
> > allocated if the inode's block size is less than the folio size, and
> > is not based on plen. The logic I'm looking at is the code inside
> > ifs_alloc().
>
> Hrmm.  If there's no ifs then blocksize == foliosize, so if
> plen < foliosize then that means we're not fully reading in the folio
> contents?  That doesn't sound good, especially if the folio can be
> mmapped into someone's address space.

I think the caller is the one who sets the lower bound on how much of
the folio gets read in (since plen is min-bounded by
iter->iomap.length). The calling filesystem shouldn't be setting the
iomap.length to less than the folio size to do iterative partial reads
if there's no ifs, so I added this "!ifs && plen < folio_size()" check
as a guard against this, as I think the bug is subtle to detect
otherwise.

But looking at some of the caller implementations, I think my above
implementation is wrong. At least one caller (zonefs, erofs) relies on
iterative partial reads for zeroing parts of the folio (eg setting
next iomap iteration on the folio as IOMAP_HOLE), which is fine since
reads using bios end the read at bio submission time (which happens at
->submit_read()). But fuse ends the read at either
->read_folio_range() or ->submit_read() time. So I think the caller
needs to specify whether it ends the read at ->read_folio_range() or
not, and only then can we invalidate ctx->cur_folio. I'll submit v4
with this change.

Thanks,
Joanne

>
> --D
>
> > Thanks,
> > Joanne
> >

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/1] iomap: fix readahead folio refcounting race
  2026-01-21  0:34             ` Joanne Koong
@ 2026-01-21  1:07               ` Matthew Wilcox
  2026-01-21  4:12                 ` Joanne Koong
  0 siblings, 1 reply; 12+ messages in thread
From: Matthew Wilcox @ 2026-01-21  1:07 UTC (permalink / raw)
  To: Joanne Koong; +Cc: Darrick J. Wong, brauner, hch, bfoster, linux-fsdevel

On Tue, Jan 20, 2026 at 04:34:22PM -0800, Joanne Koong wrote:
> But looking at some of the caller implementations, I think my above
> implementation is wrong. At least one caller (zonefs, erofs) relies on
> iterative partial reads for zeroing parts of the folio (eg setting
> next iomap iteration on the folio as IOMAP_HOLE), which is fine since
> reads using bios end the read at bio submission time (which happens at
> ->submit_read()). But fuse ends the read at either
> ->read_folio_range() or ->submit_read() time. So I think the caller
> needs to specify whether it ends the read at ->read_folio_range() or
> not, and only then can we invalidate ctx->cur_folio. I'll submit v4
> with this change.

... but it can only do that on a block size boundary!  Which means that
if the block size is smaller than the folio size, we'll allocate an ifs.
If the block size is equal to the folio size, we won't allocate an IFS,
but neither will the length be less than the folio size ... so the return
of -EIO was dead code, like I said.  Right?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/1] iomap: fix readahead folio refcounting race
  2026-01-21  1:07               ` Matthew Wilcox
@ 2026-01-21  4:12                 ` Joanne Koong
  2026-01-21  4:42                   ` Matthew Wilcox
  0 siblings, 1 reply; 12+ messages in thread
From: Joanne Koong @ 2026-01-21  4:12 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Darrick J. Wong, brauner, hch, bfoster, linux-fsdevel

On Tue, Jan 20, 2026 at 5:07 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Jan 20, 2026 at 04:34:22PM -0800, Joanne Koong wrote:
> > But looking at some of the caller implementations, I think my above
> > implementation is wrong. At least one caller (zonefs, erofs) relies on
> > iterative partial reads for zeroing parts of the folio (eg setting
> > next iomap iteration on the folio as IOMAP_HOLE), which is fine since
> > reads using bios end the read at bio submission time (which happens at
> > ->submit_read()). But fuse ends the read at either
> > ->read_folio_range() or ->submit_read() time. So I think the caller
> > needs to specify whether it ends the read at ->read_folio_range() or
> > not, and only then can we invalidate ctx->cur_folio. I'll submit v4
> > with this change.
>
> ... but it can only do that on a block size boundary!  Which means that
> if the block size is smaller than the folio size, we'll allocate an ifs.
> If the block size is equal to the folio size, we won't allocate an IFS,
> but neither will the length be less than the folio size ... so the return
> of -EIO was dead code, like I said.  Right?

Maybe I'm totally misreading this then, but can't the file size be
non-block-aligned even if the filesystem is block-based, which means
"iomap->length = i_size_read(inode) - iomap->offset" (as in
zonefs_read_iomap_begin()) isn't guaranteed to always be a
block-aligned mapping length (eg leading to the case where plen <
folio_size and block_size == folio_size)? I see for direct io writes
that the write size is enforced to be block-aligned (in
zonefs_file_dio_write()) and seq files must go through direct io, but
I don't see that this applies to buffered writes for non-seq files,
which I think means inode->i_size can be non-block-aligned.

Thanks,
Joanne

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/1] iomap: fix readahead folio refcounting race
  2026-01-21  4:12                 ` Joanne Koong
@ 2026-01-21  4:42                   ` Matthew Wilcox
  2026-01-21  5:51                     ` Joanne Koong
  0 siblings, 1 reply; 12+ messages in thread
From: Matthew Wilcox @ 2026-01-21  4:42 UTC (permalink / raw)
  To: Joanne Koong; +Cc: Darrick J. Wong, brauner, hch, bfoster, linux-fsdevel

On Tue, Jan 20, 2026 at 08:12:10PM -0800, Joanne Koong wrote:
> On Tue, Jan 20, 2026 at 5:07 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Tue, Jan 20, 2026 at 04:34:22PM -0800, Joanne Koong wrote:
> > > But looking at some of the caller implementations, I think my above
> > > implementation is wrong. At least one caller (zonefs, erofs) relies on
> > > iterative partial reads for zeroing parts of the folio (eg setting
> > > next iomap iteration on the folio as IOMAP_HOLE), which is fine since
> > > reads using bios end the read at bio submission time (which happens at
> > > ->submit_read()). But fuse ends the read at either
> > > ->read_folio_range() or ->submit_read() time. So I think the caller
> > > needs to specify whether it ends the read at ->read_folio_range() or
> > > not, and only then can we invalidate ctx->cur_folio. I'll submit v4
> > > with this change.
> >
> > ... but it can only do that on a block size boundary!  Which means that
> > if the block size is smaller than the folio size, we'll allocate an ifs.
> > If the block size is equal to the folio size, we won't allocate an IFS,
> > but neither will the length be less than the folio size ... so the return
> > of -EIO was dead code, like I said.  Right?
> 
> Maybe I'm totally misreading this then, but can't the file size be
> non-block-aligned even if the filesystem is block-based, which means
> "iomap->length = i_size_read(inode) - iomap->offset" (as in
> zonefs_read_iomap_begin()) isn't guaranteed to always be a
> block-aligned mapping length (eg leading to the case where plen <
> folio_size and block_size == folio_size)? I see for direct io writes
> that the write size is enforced to be block-aligned (in
> zonefs_file_dio_write()) and seq files must go through direct io, but
> I don't see that this applies to buffered writes for non-seq files,
> which I think means inode->i_size can be non-block-aligned.

I think the important thing is that a block device can only do I/O in
units of block size!  Let's work an example.

Lets say we're on a 4KiB block device and have a 4KiB folio.  The file in
question is a mere 700 bytes in size.  Regardless of what the filesystem
asks iomap for (700 bytes or 4096 bytes), the BIO that goes to the
device instructs it to read one 4KiB block.  Once the read has completed,
all bytes in the folio are now uptodate and the completion handler can
call folio_end_read().

If we were on a 512 byte block device, we'd allocate an IFS, do an I/O
for 2 * 512 byte blocks and zero the remaining 3KiB with memset().
Whichever of the memset() or read-completion happened later calls
folio_end_read().

(some filesystems zero post-EOF bytes after the DMA has completed;
I believe ntfs3 does, for example.  But I don't think XFS or iomap does
that; it relies on post-EOF bytes being zeroed in the writeback path)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/1] iomap: fix readahead folio refcounting race
  2026-01-21  4:42                   ` Matthew Wilcox
@ 2026-01-21  5:51                     ` Joanne Koong
  0 siblings, 0 replies; 12+ messages in thread
From: Joanne Koong @ 2026-01-21  5:51 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Darrick J. Wong, brauner, hch, bfoster, linux-fsdevel

On Tue, Jan 20, 2026 at 8:42 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Jan 20, 2026 at 08:12:10PM -0800, Joanne Koong wrote:
> > On Tue, Jan 20, 2026 at 5:07 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Tue, Jan 20, 2026 at 04:34:22PM -0800, Joanne Koong wrote:
> > > > But looking at some of the caller implementations, I think my above
> > > > implementation is wrong. At least one caller (zonefs, erofs) relies on
> > > > iterative partial reads for zeroing parts of the folio (eg setting
> > > > next iomap iteration on the folio as IOMAP_HOLE), which is fine since
> > > > reads using bios end the read at bio submission time (which happens at
> > > > ->submit_read()). But fuse ends the read at either
> > > > ->read_folio_range() or ->submit_read() time. So I think the caller
> > > > needs to specify whether it ends the read at ->read_folio_range() or
> > > > not, and only then can we invalidate ctx->cur_folio. I'll submit v4
> > > > with this change.
> > >
> > > ... but it can only do that on a block size boundary!  Which means that
> > > if the block size is smaller than the folio size, we'll allocate an ifs.
> > > If the block size is equal to the folio size, we won't allocate an IFS,
> > > but neither will the length be less than the folio size ... so the return
> > > of -EIO was dead code, like I said.  Right?
> >
> > Maybe I'm totally misreading this then, but can't the file size be
> > non-block-aligned even if the filesystem is block-based, which means
> > "iomap->length = i_size_read(inode) - iomap->offset" (as in
> > zonefs_read_iomap_begin()) isn't guaranteed to always be a
> > block-aligned mapping length (eg leading to the case where plen <
> > folio_size and block_size == folio_size)? I see for direct io writes
> > that the write size is enforced to be block-aligned (in
> > zonefs_file_dio_write()) and seq files must go through direct io, but
> > I don't see that this applies to buffered writes for non-seq files,
> > which I think means inode->i_size can be non-block-aligned.
>
> I think the important thing is that a block device can only do I/O in
> units of block size!  Let's work an example.
>
> Lets say we're on a 4KiB block device and have a 4KiB folio.  The file in
> question is a mere 700 bytes in size.  Regardless of what the filesystem
> asks iomap for (700 bytes or 4096 bytes), the BIO that goes to the
> device instructs it to read one 4KiB block.  Once the read has completed,
> all bytes in the folio are now uptodate and the completion handler can
> call folio_end_read().
>
> If we were on a 512 byte block device, we'd allocate an IFS, do an I/O
> for 2 * 512 byte blocks and zero the remaining 3KiB with memset().
> Whichever of the memset() or read-completion happened later calls
> folio_end_read().
>
> (some filesystems zero post-EOF bytes after the DMA has completed;
> I believe ntfs3 does, for example.  But I don't think XFS or iomap does
> that; it relies on post-EOF bytes being zeroed in the writeback path)

I see, thanks for your patience. Basically what you are saying is that
it never makes sense for a filesystem to set the iomap->length value
in ->iomap_begin() to something that's less than the block size
because all I/O submissions to the bio layer have block-aligned
lengths. That makes sense to me.

(btw, I realize where I went wrong with the zonefs analysis above -
the zonefs doc [1] says for non-seq (conventional) files "The size of
conventional zone files is fixed to the size of the zone that they
represent. Conventional zone files cannot be truncated". which means
the i_size_read() value for zonefs is always block-aligned).

You were right about the if check being unnecessary, I'll drop it.

Thanks,
Joanne

[1] https://zonedstorage.io/docs/filesystems/zonefs

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2026-01-21  5:51 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-16  1:54 [PATCH v2 0/1] iomap: fix readahead folio refcounting race Joanne Koong
2026-01-16  1:54 ` [PATCH v2 1/1] " Joanne Koong
2026-01-16  2:52   ` Matthew Wilcox
2026-01-16 18:36     ` Joanne Koong
2026-01-16 21:45       ` Matthew Wilcox
2026-01-16 22:02         ` Joanne Koong
2026-01-17  2:30           ` Darrick J. Wong
2026-01-21  0:34             ` Joanne Koong
2026-01-21  1:07               ` Matthew Wilcox
2026-01-21  4:12                 ` Joanne Koong
2026-01-21  4:42                   ` Matthew Wilcox
2026-01-21  5:51                     ` Joanne Koong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox