[PATCH] iomap: fix race when reading in all bytes of a folio

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] iomap: fix race when reading in all bytes of a folio
@ 2025-10-24 21:50 Joanne Koong
  2025-10-27  7:02 ` Christoph Hellwig
  2025-10-27 12:20 ` Brian Foster
  0 siblings, 2 replies; 7+ messages in thread
From: Joanne Koong @ 2025-10-24 21:50 UTC (permalink / raw)
  To: brauner; +Cc: bfoster, hch, djwong, linux-fsdevel

There is a race where if all bytes in a folio need to get read in and
the filesystem finishes reading the bytes in before the call to
iomap_read_end(), then bytes_accounted in iomap_read_end() will be 0 and
the following "ifs->read_bytes_pending -= bytes_accounting" will also be
0 which will trigger an extra folio_end_read() call. This extra
folio_end_read() unlocks the folio for the 2nd time, which sets the lock
bit on the folio, resulting in a permanent lockup.

Fix this by returning from iomap_read_end() early if all bytes are read
in by the filesystem.

Additionally, add some comments to clarify how this accounting logic works.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Fixes: 51311f045375 ("iomap: track pending read bytes more optimally")
Reported-by: Brian Foster <bfoster@redhat.com>
--
This is a fix for commit 51311f045375 in the 'vfs-6.19.iomap' branch. It
would be great if this could get folded up into that original commit, if it's
not too logistically messy to do so.

Thanks,
Joanne
---
 fs/iomap/buffered-io.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 72196e5021b1..c31d30643e2d 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -358,6 +358,25 @@ static void iomap_read_init(struct folio *folio)
 	if (ifs) {
 		size_t len = folio_size(folio);
 
+		/*
+		 * ifs->read_bytes_pending is used to track how many bytes are
+		 * read in asynchronously by the filesystem. We need to track
+		 * this so that we can know when the filesystem has finished
+		 * reading in the folio whereupon folio_end_read() should be
+		 * called.
+		 *
+		 * We first set ifs->read_bytes_pending to the entire folio
+		 * size. Then we track how many bytes are read in by the
+		 * filesystem. At the end, in iomap_read_end(), we subtract
+		 * ifs->read_bytes_pending by the number of bytes NOT read in so
+		 * that ifs->read_bytes_pending will be 0 when the filesystem
+		 * has finished reading in all pending bytes.
+		 *
+		 * ifs->read_bytes_pending is initialized to the folio size
+		 * because we do not easily know in the beginning how many
+		 * bytes need to get read in by the filesystem (eg some ranges
+		 * may already be uptodate).
+		 */
 		spin_lock_irq(&ifs->state_lock);
 		ifs->read_bytes_pending += len;
 		spin_unlock_irq(&ifs->state_lock);
@@ -383,6 +402,9 @@ static void iomap_read_end(struct folio *folio, size_t bytes_pending)
 		bool end_read, uptodate;
 		size_t bytes_accounted = folio_size(folio) - bytes_pending;
 
+		if (!bytes_accounted)
+			return;
+
 		spin_lock_irq(&ifs->state_lock);
 		ifs->read_bytes_pending -= bytes_accounted;
 		/*
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] iomap: fix race when reading in all bytes of a folio
  2025-10-24 21:50 [PATCH] iomap: fix race when reading in all bytes of a folio Joanne Koong
@ 2025-10-27  7:02 ` Christoph Hellwig
  2025-10-27 12:20 ` Brian Foster
  1 sibling, 0 replies; 7+ messages in thread
From: Christoph Hellwig @ 2025-10-27  7:02 UTC (permalink / raw)
  To: Joanne Koong; +Cc: brauner, bfoster, hch, djwong, linux-fsdevel

On Fri, Oct 24, 2025 at 02:50:08PM -0700, Joanne Koong wrote:
> There is a race where if all bytes in a folio need to get read in and
> the filesystem finishes reading the bytes in before the call to
> iomap_read_end(), then bytes_accounted in iomap_read_end() will be 0 and
> the following "ifs->read_bytes_pending -= bytes_accounting" will also be
> 0 which will trigger an extra folio_end_read() call. This extra
> folio_end_read() unlocks the folio for the 2nd time, which sets the lock
> bit on the folio, resulting in a permanent lockup.
> 
> Fix this by returning from iomap_read_end() early if all bytes are read
> in by the filesystem.
> 
> Additionally, add some comments to clarify how this accounting logic works.
> 
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> Fixes: 51311f045375 ("iomap: track pending read bytes more optimally")
> Reported-by: Brian Foster <bfoster@redhat.com>

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

> This is a fix for commit 51311f045375 in the 'vfs-6.19.iomap' branch. It
> would be great if this could get folded up into that original commit, if it's
> not too logistically messy to do so.

Agreed!


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] iomap: fix race when reading in all bytes of a folio
  2025-10-24 21:50 [PATCH] iomap: fix race when reading in all bytes of a folio Joanne Koong
  2025-10-27  7:02 ` Christoph Hellwig
@ 2025-10-27 12:20 ` Brian Foster
  2025-10-27 16:43   ` Joanne Koong
  1 sibling, 1 reply; 7+ messages in thread
From: Brian Foster @ 2025-10-27 12:20 UTC (permalink / raw)
  To: Joanne Koong; +Cc: brauner, hch, djwong, linux-fsdevel

On Fri, Oct 24, 2025 at 02:50:08PM -0700, Joanne Koong wrote:
> There is a race where if all bytes in a folio need to get read in and
> the filesystem finishes reading the bytes in before the call to
> iomap_read_end(), then bytes_accounted in iomap_read_end() will be 0 and
> the following "ifs->read_bytes_pending -= bytes_accounting" will also be
> 0 which will trigger an extra folio_end_read() call. This extra
> folio_end_read() unlocks the folio for the 2nd time, which sets the lock
> bit on the folio, resulting in a permanent lockup.
> 
> Fix this by returning from iomap_read_end() early if all bytes are read
> in by the filesystem.
> 
> Additionally, add some comments to clarify how this accounting logic works.
> 
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> Fixes: 51311f045375 ("iomap: track pending read bytes more optimally")
> Reported-by: Brian Foster <bfoster@redhat.com>
> --
> This is a fix for commit 51311f045375 in the 'vfs-6.19.iomap' branch. It
> would be great if this could get folded up into that original commit, if it's
> not too logistically messy to do so.
> 
> Thanks,
> Joanne
> ---
>  fs/iomap/buffered-io.c | 22 ++++++++++++++++++++++
>  1 file changed, 22 insertions(+)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 72196e5021b1..c31d30643e2d 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -358,6 +358,25 @@ static void iomap_read_init(struct folio *folio)
>  	if (ifs) {
>  		size_t len = folio_size(folio);
>  
> +		/*
> +		 * ifs->read_bytes_pending is used to track how many bytes are
> +		 * read in asynchronously by the filesystem. We need to track
> +		 * this so that we can know when the filesystem has finished
> +		 * reading in the folio whereupon folio_end_read() should be
> +		 * called.
> +		 *
> +		 * We first set ifs->read_bytes_pending to the entire folio
> +		 * size. Then we track how many bytes are read in by the
> +		 * filesystem. At the end, in iomap_read_end(), we subtract
> +		 * ifs->read_bytes_pending by the number of bytes NOT read in so
> +		 * that ifs->read_bytes_pending will be 0 when the filesystem
> +		 * has finished reading in all pending bytes.
> +		 *
> +		 * ifs->read_bytes_pending is initialized to the folio size
> +		 * because we do not easily know in the beginning how many
> +		 * bytes need to get read in by the filesystem (eg some ranges
> +		 * may already be uptodate).
> +		 */

Hmm.. "we do this because we don't easily know how many bytes to read,"
but apparently that's how this worked before by bumping the count as
reads were submitted..? I'm not sure this is really telling much. I'd
suggest something like (and feel free to completely rework any of
this)..

"Increase ->read_bytes_pending by the folio size to start. We'll
subtract uptodate ranges that did not require I/O in iomap_read_end()
once we're done processing the read. We do this because <reasons>."

... where <reasons> explains to somebody who might look at this in a
month or year and wonder why we don't just bump read_bytes_pending as we
go.

>  		spin_lock_irq(&ifs->state_lock);
>  		ifs->read_bytes_pending += len;
>  		spin_unlock_irq(&ifs->state_lock);
> @@ -383,6 +402,9 @@ static void iomap_read_end(struct folio *folio, size_t bytes_pending)

This function could use a comment at the top to explain it's meant for
ending read submission (not necessarily I/O, since that appears to be
open coded in finish_folio_read()).

>  		bool end_read, uptodate;
>  		size_t bytes_accounted = folio_size(folio) - bytes_pending;
>  

"Subtract any bytes that were initially accounted against
read_bytes_pending but skipped for I/O. If zero, then the entire folio
was submitted and we're done. I/O completion handles the rest."

Also, maybe I'm missing something but the !bytes_accounted case means
the I/O owns the folio lock now, right? If so, is it safe to access the
folio from here (i.e. folio_size() above)?

Comments aside, this survives a bunch of iters of my original
reproducer, so seems Ok from that standpoint.

Brian

> +		if (!bytes_accounted)
> +			return;
> +
>  		spin_lock_irq(&ifs->state_lock);
>  		ifs->read_bytes_pending -= bytes_accounted;
>  		/*
> -- 
> 2.47.3
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] iomap: fix race when reading in all bytes of a folio
  2025-10-27 12:20 ` Brian Foster
@ 2025-10-27 16:43   ` Joanne Koong
  2025-10-28 11:20     ` Brian Foster
  0 siblings, 1 reply; 7+ messages in thread
From: Joanne Koong @ 2025-10-27 16:43 UTC (permalink / raw)
  To: Brian Foster; +Cc: brauner, hch, djwong, linux-fsdevel

On Mon, Oct 27, 2025 at 5:16 AM Brian Foster <bfoster@redhat.com> wrote:
>
> On Fri, Oct 24, 2025 at 02:50:08PM -0700, Joanne Koong wrote:
> > There is a race where if all bytes in a folio need to get read in and
> > the filesystem finishes reading the bytes in before the call to
> > iomap_read_end(), then bytes_accounted in iomap_read_end() will be 0 and
> > the following "ifs->read_bytes_pending -= bytes_accounting" will also be
> > 0 which will trigger an extra folio_end_read() call. This extra
> > folio_end_read() unlocks the folio for the 2nd time, which sets the lock
> > bit on the folio, resulting in a permanent lockup.
> >
> > Fix this by returning from iomap_read_end() early if all bytes are read
> > in by the filesystem.
> >
> > Additionally, add some comments to clarify how this accounting logic works.
> >
> > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > Fixes: 51311f045375 ("iomap: track pending read bytes more optimally")
> > Reported-by: Brian Foster <bfoster@redhat.com>
> > --
> > This is a fix for commit 51311f045375 in the 'vfs-6.19.iomap' branch. It
> > would be great if this could get folded up into that original commit, if it's
> > not too logistically messy to do so.
> >
> > Thanks,
> > Joanne
> > ---
> >  fs/iomap/buffered-io.c | 22 ++++++++++++++++++++++
> >  1 file changed, 22 insertions(+)
> >
> > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > index 72196e5021b1..c31d30643e2d 100644
> > --- a/fs/iomap/buffered-io.c
> > +++ b/fs/iomap/buffered-io.c
> > @@ -358,6 +358,25 @@ static void iomap_read_init(struct folio *folio)
> >       if (ifs) {
> >               size_t len = folio_size(folio);
> >
> > +             /*
> > +              * ifs->read_bytes_pending is used to track how many bytes are
> > +              * read in asynchronously by the filesystem. We need to track
> > +              * this so that we can know when the filesystem has finished
> > +              * reading in the folio whereupon folio_end_read() should be
> > +              * called.
> > +              *
> > +              * We first set ifs->read_bytes_pending to the entire folio
> > +              * size. Then we track how many bytes are read in by the
> > +              * filesystem. At the end, in iomap_read_end(), we subtract
> > +              * ifs->read_bytes_pending by the number of bytes NOT read in so
> > +              * that ifs->read_bytes_pending will be 0 when the filesystem
> > +              * has finished reading in all pending bytes.
> > +              *
> > +              * ifs->read_bytes_pending is initialized to the folio size
> > +              * because we do not easily know in the beginning how many
> > +              * bytes need to get read in by the filesystem (eg some ranges
> > +              * may already be uptodate).
> > +              */
>
> Hmm.. "we do this because we don't easily know how many bytes to read,"
> but apparently that's how this worked before by bumping the count as
> reads were submitted..? I'm not sure this is really telling much. I'd
> suggest something like (and feel free to completely rework any of
> this)..

Ahh with that sentence I was trying to convey that we need to do this
because we don't easily know in the beginning how many bytes need to
get read in (eg if we knew we'll be reading in 2k bytes, then we could
just set that to 2k instead of the folio size, and skip all the
accounting stuff). I will get rid of this and replace it with your
suggestion.

>
> "Increase ->read_bytes_pending by the folio size to start. We'll
> subtract uptodate ranges that did not require I/O in iomap_read_end()
> once we're done processing the read. We do this because <reasons>."
>
> ... where <reasons> explains to somebody who might look at this in a
> month or year and wonder why we don't just bump read_bytes_pending as we
> go.

Sounds good, I will use this for v2.
>
> >               spin_lock_irq(&ifs->state_lock);
> >               ifs->read_bytes_pending += len;
> >               spin_unlock_irq(&ifs->state_lock);
> > @@ -383,6 +402,9 @@ static void iomap_read_end(struct folio *folio, size_t bytes_pending)
>
> This function could use a comment at the top to explain it's meant for
> ending read submission (not necessarily I/O, since that appears to be
> open coded in finish_folio_read()).
>
> >               bool end_read, uptodate;
> >               size_t bytes_accounted = folio_size(folio) - bytes_pending;
> >
>
> "Subtract any bytes that were initially accounted against
> read_bytes_pending but skipped for I/O. If zero, then the entire folio
> was submitted and we're done. I/O completion handles the rest."

Will add this in v2.

>
> Also, maybe I'm missing something but the !bytes_accounted case means
> the I/O owns the folio lock now, right? If so, is it safe to access the
> folio from here (i.e. folio_size() above)?

I believe it is because there's still a reference on the folio here
since this can only be reached from the ->read_iter() and
->readahead() paths, which prevents the folio from getting evicted
(which is what I think you're referring to?).

Thanks,
Joanne
>
> Comments aside, this survives a bunch of iters of my original
> reproducer, so seems Ok from that standpoint.
>
> Brian
>
> > +             if (!bytes_accounted)
> > +                     return;
> > +
> >               spin_lock_irq(&ifs->state_lock);
> >               ifs->read_bytes_pending -= bytes_accounted;
> >               /*
> > --
> > 2.47.3
> >
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] iomap: fix race when reading in all bytes of a folio
  2025-10-27 16:43   ` Joanne Koong
@ 2025-10-28 11:20     ` Brian Foster
  2025-10-28 17:11       ` Joanne Koong
  0 siblings, 1 reply; 7+ messages in thread
From: Brian Foster @ 2025-10-28 11:20 UTC (permalink / raw)
  To: Joanne Koong; +Cc: brauner, hch, djwong, linux-fsdevel

On Mon, Oct 27, 2025 at 09:43:59AM -0700, Joanne Koong wrote:
> On Mon, Oct 27, 2025 at 5:16 AM Brian Foster <bfoster@redhat.com> wrote:
> >
> > On Fri, Oct 24, 2025 at 02:50:08PM -0700, Joanne Koong wrote:
> > > There is a race where if all bytes in a folio need to get read in and
> > > the filesystem finishes reading the bytes in before the call to
> > > iomap_read_end(), then bytes_accounted in iomap_read_end() will be 0 and
> > > the following "ifs->read_bytes_pending -= bytes_accounting" will also be
> > > 0 which will trigger an extra folio_end_read() call. This extra
> > > folio_end_read() unlocks the folio for the 2nd time, which sets the lock
> > > bit on the folio, resulting in a permanent lockup.
> > >
> > > Fix this by returning from iomap_read_end() early if all bytes are read
> > > in by the filesystem.
> > >
> > > Additionally, add some comments to clarify how this accounting logic works.
> > >
> > > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > > Fixes: 51311f045375 ("iomap: track pending read bytes more optimally")
> > > Reported-by: Brian Foster <bfoster@redhat.com>
> > > --
> > > This is a fix for commit 51311f045375 in the 'vfs-6.19.iomap' branch. It
> > > would be great if this could get folded up into that original commit, if it's
> > > not too logistically messy to do so.
> > >
> > > Thanks,
> > > Joanne
> > > ---
> > >  fs/iomap/buffered-io.c | 22 ++++++++++++++++++++++
> > >  1 file changed, 22 insertions(+)
> > >
> > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > > index 72196e5021b1..c31d30643e2d 100644
> > > --- a/fs/iomap/buffered-io.c
> > > +++ b/fs/iomap/buffered-io.c
> > > @@ -358,6 +358,25 @@ static void iomap_read_init(struct folio *folio)
> > >       if (ifs) {
> > >               size_t len = folio_size(folio);
> > >
> > > +             /*
> > > +              * ifs->read_bytes_pending is used to track how many bytes are
> > > +              * read in asynchronously by the filesystem. We need to track
> > > +              * this so that we can know when the filesystem has finished
> > > +              * reading in the folio whereupon folio_end_read() should be
> > > +              * called.
> > > +              *
> > > +              * We first set ifs->read_bytes_pending to the entire folio
> > > +              * size. Then we track how many bytes are read in by the
> > > +              * filesystem. At the end, in iomap_read_end(), we subtract
> > > +              * ifs->read_bytes_pending by the number of bytes NOT read in so
> > > +              * that ifs->read_bytes_pending will be 0 when the filesystem
> > > +              * has finished reading in all pending bytes.
> > > +              *
> > > +              * ifs->read_bytes_pending is initialized to the folio size
> > > +              * because we do not easily know in the beginning how many
> > > +              * bytes need to get read in by the filesystem (eg some ranges
> > > +              * may already be uptodate).
> > > +              */
> >
> > Hmm.. "we do this because we don't easily know how many bytes to read,"
> > but apparently that's how this worked before by bumping the count as
> > reads were submitted..? I'm not sure this is really telling much. I'd
> > suggest something like (and feel free to completely rework any of
> > this)..
> 
> Ahh with that sentence I was trying to convey that we need to do this
> because we don't easily know in the beginning how many bytes need to
> get read in (eg if we knew we'll be reading in 2k bytes, then we could
> just set that to 2k instead of the folio size, and skip all the
> accounting stuff). I will get rid of this and replace it with your
> suggestion.
> 
> >
> > "Increase ->read_bytes_pending by the folio size to start. We'll
> > subtract uptodate ranges that did not require I/O in iomap_read_end()
> > once we're done processing the read. We do this because <reasons>."
> >
> > ... where <reasons> explains to somebody who might look at this in a
> > month or year and wonder why we don't just bump read_bytes_pending as we
> > go.
> 
> Sounds good, I will use this for v2.
> >
> > >               spin_lock_irq(&ifs->state_lock);
> > >               ifs->read_bytes_pending += len;
> > >               spin_unlock_irq(&ifs->state_lock);
> > > @@ -383,6 +402,9 @@ static void iomap_read_end(struct folio *folio, size_t bytes_pending)
> >
> > This function could use a comment at the top to explain it's meant for
> > ending read submission (not necessarily I/O, since that appears to be
> > open coded in finish_folio_read()).
> >
> > >               bool end_read, uptodate;
> > >               size_t bytes_accounted = folio_size(folio) - bytes_pending;
> > >
> >
> > "Subtract any bytes that were initially accounted against
> > read_bytes_pending but skipped for I/O. If zero, then the entire folio
> > was submitted and we're done. I/O completion handles the rest."
> 
> Will add this in v2.
> 
> >
> > Also, maybe I'm missing something but the !bytes_accounted case means
> > the I/O owns the folio lock now, right? If so, is it safe to access the
> > folio from here (i.e. folio_size() above)?
> 
> I believe it is because there's still a reference on the folio here
> since this can only be reached from the ->read_iter() and
> ->readahead() paths, which prevents the folio from getting evicted
> (which is what I think you're referring to?).
> 

Yep, but also whether the size or anything can change.. can a large
folio split, for example?

ISTM the proper thing to do would be to somehow ensure the read submit
side finishes (with the folio) before the completion side can complete
the I/O, then there is an explicit ownership boundary and these sorts of
racy folio access questions can go away. Just my 02.

Brian

> Thanks,
> Joanne
> >
> > Comments aside, this survives a bunch of iters of my original
> > reproducer, so seems Ok from that standpoint.
> >
> > Brian
> >
> > > +             if (!bytes_accounted)
> > > +                     return;
> > > +
> > >               spin_lock_irq(&ifs->state_lock);
> > >               ifs->read_bytes_pending -= bytes_accounted;
> > >               /*
> > > --
> > > 2.47.3
> > >
> >
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] iomap: fix race when reading in all bytes of a folio
  2025-10-28 11:20     ` Brian Foster
@ 2025-10-28 17:11       ` Joanne Koong
  2025-10-28 17:31         ` Joanne Koong
  0 siblings, 1 reply; 7+ messages in thread
From: Joanne Koong @ 2025-10-28 17:11 UTC (permalink / raw)
  To: Brian Foster; +Cc: brauner, hch, djwong, linux-fsdevel

On Tue, Oct 28, 2025 at 4:16 AM Brian Foster <bfoster@redhat.com> wrote:
>
> On Mon, Oct 27, 2025 at 09:43:59AM -0700, Joanne Koong wrote:
> > On Mon, Oct 27, 2025 at 5:16 AM Brian Foster <bfoster@redhat.com> wrote:
> > >
> > > On Fri, Oct 24, 2025 at 02:50:08PM -0700, Joanne Koong wrote:
> > > > There is a race where if all bytes in a folio need to get read in and
> > > > the filesystem finishes reading the bytes in before the call to
> > > > iomap_read_end(), then bytes_accounted in iomap_read_end() will be 0 and
> > > > the following "ifs->read_bytes_pending -= bytes_accounting" will also be
> > > > 0 which will trigger an extra folio_end_read() call. This extra
> > > > folio_end_read() unlocks the folio for the 2nd time, which sets the lock
> > > > bit on the folio, resulting in a permanent lockup.
> > > >
> > > > Fix this by returning from iomap_read_end() early if all bytes are read
> > > > in by the filesystem.
> > > >
> > > > Additionally, add some comments to clarify how this accounting logic works.
> > > >
> > > > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > > > Fixes: 51311f045375 ("iomap: track pending read bytes more optimally")
> > > > Reported-by: Brian Foster <bfoster@redhat.com>
> > > > --
> > > > This is a fix for commit 51311f045375 in the 'vfs-6.19.iomap' branch. It
> > > > would be great if this could get folded up into that original commit, if it's
> > > > not too logistically messy to do so.
> > > >
> > > > Thanks,
> > > > Joanne
> > > > ---
> > > >  fs/iomap/buffered-io.c | 22 ++++++++++++++++++++++
> > > >  1 file changed, 22 insertions(+)
> > > >
> > > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > > > index 72196e5021b1..c31d30643e2d 100644
> > > > --- a/fs/iomap/buffered-io.c
> > > > +++ b/fs/iomap/buffered-io.c
> > > > @@ -358,6 +358,25 @@ static void iomap_read_init(struct folio *folio)
> > > >       if (ifs) {
> > > >               size_t len = folio_size(folio);
> > > >
> > > > +             /*
> > > > +              * ifs->read_bytes_pending is used to track how many bytes are
> > > > +              * read in asynchronously by the filesystem. We need to track
> > > > +              * this so that we can know when the filesystem has finished
> > > > +              * reading in the folio whereupon folio_end_read() should be
> > > > +              * called.
> > > > +              *
> > > > +              * We first set ifs->read_bytes_pending to the entire folio
> > > > +              * size. Then we track how many bytes are read in by the
> > > > +              * filesystem. At the end, in iomap_read_end(), we subtract
> > > > +              * ifs->read_bytes_pending by the number of bytes NOT read in so
> > > > +              * that ifs->read_bytes_pending will be 0 when the filesystem
> > > > +              * has finished reading in all pending bytes.
> > > > +              *
> > > > +              * ifs->read_bytes_pending is initialized to the folio size
> > > > +              * because we do not easily know in the beginning how many
> > > > +              * bytes need to get read in by the filesystem (eg some ranges
> > > > +              * may already be uptodate).
> > > > +              */
> > >
> > > Hmm.. "we do this because we don't easily know how many bytes to read,"
> > > but apparently that's how this worked before by bumping the count as
> > > reads were submitted..? I'm not sure this is really telling much. I'd
> > > suggest something like (and feel free to completely rework any of
> > > this)..
> >
> > Ahh with that sentence I was trying to convey that we need to do this
> > because we don't easily know in the beginning how many bytes need to
> > get read in (eg if we knew we'll be reading in 2k bytes, then we could
> > just set that to 2k instead of the folio size, and skip all the
> > accounting stuff). I will get rid of this and replace it with your
> > suggestion.
> >
> > >
> > > "Increase ->read_bytes_pending by the folio size to start. We'll
> > > subtract uptodate ranges that did not require I/O in iomap_read_end()
> > > once we're done processing the read. We do this because <reasons>."
> > >
> > > ... where <reasons> explains to somebody who might look at this in a
> > > month or year and wonder why we don't just bump read_bytes_pending as we
> > > go.
> >
> > Sounds good, I will use this for v2.
> > >
> > > >               spin_lock_irq(&ifs->state_lock);
> > > >               ifs->read_bytes_pending += len;
> > > >               spin_unlock_irq(&ifs->state_lock);
> > > > @@ -383,6 +402,9 @@ static void iomap_read_end(struct folio *folio, size_t bytes_pending)
> > >
> > > This function could use a comment at the top to explain it's meant for
> > > ending read submission (not necessarily I/O, since that appears to be
> > > open coded in finish_folio_read()).
> > >
> > > >               bool end_read, uptodate;
> > > >               size_t bytes_accounted = folio_size(folio) - bytes_pending;
> > > >
> > >
> > > "Subtract any bytes that were initially accounted against
> > > read_bytes_pending but skipped for I/O. If zero, then the entire folio
> > > was submitted and we're done. I/O completion handles the rest."
> >
> > Will add this in v2.
> >
> > >
> > > Also, maybe I'm missing something but the !bytes_accounted case means
> > > the I/O owns the folio lock now, right? If so, is it safe to access the
> > > folio from here (i.e. folio_size() above)?
> >
> > I believe it is because there's still a reference on the folio here
> > since this can only be reached from the ->read_iter() and
> > ->readahead() paths, which prevents the folio from getting evicted
> > (which is what I think you're referring to?).
> >
>
> Yep, but also whether the size or anything can change.. can a large
> folio split, for example?
>
> ISTM the proper thing to do would be to somehow ensure the read submit
> side finishes (with the folio) before the completion side can complete
> the I/O, then there is an explicit ownership boundary and these sorts of
> racy folio access questions can go away. Just my 02.

Hmm, I don't think there's any way to ensure the read submit side
finishes before the completion side completes.
I think what we can do though is stash the size of the folio and just
use that. I'll make this change to v3 to be safe.

Thanks,
Joanne

>
> Brian
>
> > Thanks,
> > Joanne
> > >
> > > Comments aside, this survives a bunch of iters of my original
> > > reproducer, so seems Ok from that standpoint.
> > >
> > > Brian
> > >
> > > > +             if (!bytes_accounted)
> > > > +                     return;
> > > > +
> > > >               spin_lock_irq(&ifs->state_lock);
> > > >               ifs->read_bytes_pending -= bytes_accounted;
> > > >               /*
> > > > --
> > > > 2.47.3
> > > >
> > >
> >
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] iomap: fix race when reading in all bytes of a folio
  2025-10-28 17:11       ` Joanne Koong
@ 2025-10-28 17:31         ` Joanne Koong
  0 siblings, 0 replies; 7+ messages in thread
From: Joanne Koong @ 2025-10-28 17:31 UTC (permalink / raw)
  To: Brian Foster; +Cc: brauner, hch, djwong, linux-fsdevel

On Tue, Oct 28, 2025 at 10:11 AM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Tue, Oct 28, 2025 at 4:16 AM Brian Foster <bfoster@redhat.com> wrote:
> >
> > On Mon, Oct 27, 2025 at 09:43:59AM -0700, Joanne Koong wrote:
> > > On Mon, Oct 27, 2025 at 5:16 AM Brian Foster <bfoster@redhat.com> wrote:
> > > >
> > > > On Fri, Oct 24, 2025 at 02:50:08PM -0700, Joanne Koong wrote:
> > > > > There is a race where if all bytes in a folio need to get read in and
> > > > > the filesystem finishes reading the bytes in before the call to
> > > > > iomap_read_end(), then bytes_accounted in iomap_read_end() will be 0 and
> > > > > the following "ifs->read_bytes_pending -= bytes_accounting" will also be
> > > > > 0 which will trigger an extra folio_end_read() call. This extra
> > > > > folio_end_read() unlocks the folio for the 2nd time, which sets the lock
> > > > > bit on the folio, resulting in a permanent lockup.
> > > > >
> > > > > Fix this by returning from iomap_read_end() early if all bytes are read
> > > > > in by the filesystem.
> > > > >
> > > > > Additionally, add some comments to clarify how this accounting logic works.
> > > > >
> > > > > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > > > > Fixes: 51311f045375 ("iomap: track pending read bytes more optimally")
> > > > > Reported-by: Brian Foster <bfoster@redhat.com>
> > > > > --
> > > > > This is a fix for commit 51311f045375 in the 'vfs-6.19.iomap' branch. It
> > > > > would be great if this could get folded up into that original commit, if it's
> > > > > not too logistically messy to do so.
> > > > >
> > > > > Thanks,
> > > > > Joanne
> > > > > ---
> > > > >  fs/iomap/buffered-io.c | 22 ++++++++++++++++++++++
> > > > >  1 file changed, 22 insertions(+)
> > > > >
> > > > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > > > > index 72196e5021b1..c31d30643e2d 100644
> > > > > --- a/fs/iomap/buffered-io.c
> > > > > +++ b/fs/iomap/buffered-io.c
> > > > > @@ -358,6 +358,25 @@ static void iomap_read_init(struct folio *folio)
> > > > >       if (ifs) {
> > > > >               size_t len = folio_size(folio);
> > > > >
> > > > > +             /*
> > > > > +              * ifs->read_bytes_pending is used to track how many bytes are
> > > > > +              * read in asynchronously by the filesystem. We need to track
> > > > > +              * this so that we can know when the filesystem has finished
> > > > > +              * reading in the folio whereupon folio_end_read() should be
> > > > > +              * called.
> > > > > +              *
> > > > > +              * We first set ifs->read_bytes_pending to the entire folio
> > > > > +              * size. Then we track how many bytes are read in by the
> > > > > +              * filesystem. At the end, in iomap_read_end(), we subtract
> > > > > +              * ifs->read_bytes_pending by the number of bytes NOT read in so
> > > > > +              * that ifs->read_bytes_pending will be 0 when the filesystem
> > > > > +              * has finished reading in all pending bytes.
> > > > > +              *
> > > > > +              * ifs->read_bytes_pending is initialized to the folio size
> > > > > +              * because we do not easily know in the beginning how many
> > > > > +              * bytes need to get read in by the filesystem (eg some ranges
> > > > > +              * may already be uptodate).
> > > > > +              */
> > > >
> > > > Hmm.. "we do this because we don't easily know how many bytes to read,"
> > > > but apparently that's how this worked before by bumping the count as
> > > > reads were submitted..? I'm not sure this is really telling much. I'd
> > > > suggest something like (and feel free to completely rework any of
> > > > this)..
> > >
> > > Ahh with that sentence I was trying to convey that we need to do this
> > > because we don't easily know in the beginning how many bytes need to
> > > get read in (eg if we knew we'll be reading in 2k bytes, then we could
> > > just set that to 2k instead of the folio size, and skip all the
> > > accounting stuff). I will get rid of this and replace it with your
> > > suggestion.
> > >
> > > >
> > > > "Increase ->read_bytes_pending by the folio size to start. We'll
> > > > subtract uptodate ranges that did not require I/O in iomap_read_end()
> > > > once we're done processing the read. We do this because <reasons>."
> > > >
> > > > ... where <reasons> explains to somebody who might look at this in a
> > > > month or year and wonder why we don't just bump read_bytes_pending as we
> > > > go.
> > >
> > > Sounds good, I will use this for v2.
> > > >
> > > > >               spin_lock_irq(&ifs->state_lock);
> > > > >               ifs->read_bytes_pending += len;
> > > > >               spin_unlock_irq(&ifs->state_lock);
> > > > > @@ -383,6 +402,9 @@ static void iomap_read_end(struct folio *folio, size_t bytes_pending)
> > > >
> > > > This function could use a comment at the top to explain it's meant for
> > > > ending read submission (not necessarily I/O, since that appears to be
> > > > open coded in finish_folio_read()).
> > > >
> > > > >               bool end_read, uptodate;
> > > > >               size_t bytes_accounted = folio_size(folio) - bytes_pending;
> > > > >
> > > >
> > > > "Subtract any bytes that were initially accounted against
> > > > read_bytes_pending but skipped for I/O. If zero, then the entire folio
> > > > was submitted and we're done. I/O completion handles the rest."
> > >
> > > Will add this in v2.
> > >
> > > >
> > > > Also, maybe I'm missing something but the !bytes_accounted case means
> > > > the I/O owns the folio lock now, right? If so, is it safe to access the
> > > > folio from here (i.e. folio_size() above)?
> > >
> > > I believe it is because there's still a reference on the folio here
> > > since this can only be reached from the ->read_iter() and
> > > ->readahead() paths, which prevents the folio from getting evicted
> > > (which is what I think you're referring to?).
> > >
> >
> > Yep, but also whether the size or anything can change.. can a large
> > folio split, for example?
> >
> > ISTM the proper thing to do would be to somehow ensure the read submit
> > side finishes (with the folio) before the completion side can complete
> > the I/O, then there is an explicit ownership boundary and these sorts of
> > racy folio access questions can go away. Just my 02.
>
> Hmm, I don't think there's any way to ensure the read submit side
> finishes before the completion side completes.
> I think what we can do though is stash the size of the folio and just
> use that. I'll make this change to v3 to be safe.

Thinking about this some more, I think there is a simpler way to
ensure this, if we initialize ifs->read_bytes_pending to folio_size()
+ 1 where that + 1 bias ensures the folio won't have been unlocked by
the time we get to iomap_read_end().

I'll do it this way for v3, as stashing the folio size would mean
having to add it to the public iomap_read_folio_ctx struct.

Thanks,
Joanne

>
> Thanks,
> Joanne
>
> >
> > Brian
> >
> > > Thanks,
> > > Joanne
> > > >
> > > > Comments aside, this survives a bunch of iters of my original
> > > > reproducer, so seems Ok from that standpoint.
> > > >
> > > > Brian
> > > >
> > > > > +             if (!bytes_accounted)
> > > > > +                     return;
> > > > > +
> > > > >               spin_lock_irq(&ifs->state_lock);
> > > > >               ifs->read_bytes_pending -= bytes_accounted;
> > > > >               /*
> > > > > --
> > > > > 2.47.3
> > > > >
> > > >
> > >
> >

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-10-28 17:31 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-24 21:50 [PATCH] iomap: fix race when reading in all bytes of a folio Joanne Koong
2025-10-27  7:02 ` Christoph Hellwig
2025-10-27 12:20 ` Brian Foster
2025-10-27 16:43   ` Joanne Koong
2025-10-28 11:20     ` Brian Foster
2025-10-28 17:11       ` Joanne Koong
2025-10-28 17:31         ` Joanne Koong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).