[PATCH] btrfs: fix incorrect readahead expansion length

Linux Btrfs filesystem development
 help / color / mirror / Atom feed

* [PATCH] btrfs: fix incorrect readahead expansion length
@ 2025-10-01 16:50 Boris Burkov
  2025-10-01 17:10 ` Filipe Manana
  2025-10-03 11:02 ` Max Chernoff
  0 siblings, 2 replies; 5+ messages in thread
From: Boris Burkov @ 2025-10-01 16:50 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

The intent of btrfs_readahead_expand() was to expand to the length of
the current compressed extent being read. However, "ram_bytes" is *not*
that, in the case where a single physical compressed extent is used for
multiple file extents.

Consider this case with a large compressed extent C and then later two
non-compressed extents N1 and N2 written over C, leaving C1 and C2
pointing to offset/len pairs of C:
[               C                 ]
[ N1 ][     C1     ][ N2 ][   C2  ]

In such a case, ram_bytes for both C1 and C2 is the full uncompressed
length of C. So starting readahead in C1 will expand the readahead past
the end of C1, past N2, and into C2. This will then expand readahead
again, to C2_start + ram_bytes, way past EOF. First of all, this is
totally undesirable, we don't want to read the whole file in arbitrary
chunks of the large underlying extent if it happens to exist. Secondly,
it results in zeroing the range past the end of C2 up to ram_bytes. This
is particularly unpleasant with fs-verity as it can zero and set
uptodate pages in the verity virtual space past EOF. This incorrect
readahead behavior can lead to verity verification errors, if we iterate
in a way that happens to do the wrong readahead.

Fix this by using em->len for readahead expansion, not em->ram_bytes,
resulting in the expected behavior of stopping readahead at the extent
boundary.

Reported-by: Max Chernoff <git@maxchernoff.ca>
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2399898
Fixes: 9e9ff875e417 ("btrfs: use readahead_expand() on compressed extents")
Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/extent_io.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index dfda8f6da194..3a8681566fc5 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -972,7 +972,7 @@ static void btrfs_readahead_expand(struct readahead_control *ractl,
 {
 	const u64 ra_pos = readahead_pos(ractl);
 	const u64 ra_end = ra_pos + readahead_length(ractl);
-	const u64 em_end = em->start + em->ram_bytes;
+	const u64 em_end = em->start + em->len;

 	/* No expansion for holes and inline extents. */
 	if (em->disk_bytenr > EXTENT_MAP_LAST_BYTE)
-- 
2.50.1

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] btrfs: fix incorrect readahead expansion length
  2025-10-01 16:50 [PATCH] btrfs: fix incorrect readahead expansion length Boris Burkov
@ 2025-10-01 17:10 ` Filipe Manana
  2025-10-01 21:14   ` Boris Burkov
  2025-10-03 11:02 ` Max Chernoff
  1 sibling, 1 reply; 5+ messages in thread
From: Filipe Manana @ 2025-10-01 17:10 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, kernel-team

On Wed, Oct 1, 2025 at 5:51 PM Boris Burkov <boris@bur.io> wrote:
>
> The intent of btrfs_readahead_expand() was to expand to the length of
> the current compressed extent being read. However, "ram_bytes" is *not*
> that, in the case where a single physical compressed extent is used for
> multiple file extents.
>
> Consider this case with a large compressed extent C and then later two
> non-compressed extents N1 and N2 written over C, leaving C1 and C2
> pointing to offset/len pairs of C:
> [               C                 ]
> [ N1 ][     C1     ][ N2 ][   C2  ]
>
> In such a case, ram_bytes for both C1 and C2 is the full uncompressed
> length of C. So starting readahead in C1 will expand the readahead past
> the end of C1, past N2, and into C2. This will then expand readahead
> again, to C2_start + ram_bytes, way past EOF. First of all, this is
> totally undesirable, we don't want to read the whole file in arbitrary
> chunks of the large underlying extent if it happens to exist. Secondly,
> it results in zeroing the range past the end of C2 up to ram_bytes. This
> is particularly unpleasant with fs-verity as it can zero and set
> uptodate pages in the verity virtual space past EOF. This incorrect
> readahead behavior can lead to verity verification errors, if we iterate
> in a way that happens to do the wrong readahead.

So this misses being clear, explicit, about the worst problem:
buffered read corruption (even when not using verity).
In that case the readahead loaded data from C into the page cache
range for N2, so then later anyone doing a buffered read for N2's
range, will get data from C.

This should be easy to turn into a test case for fstests too.

With that changelog update:

Reviewed-by: Filipe Manana <fdmanana@suse.com>

>
> Fix this by using em->len for readahead expansion, not em->ram_bytes,
> resulting in the expected behavior of stopping readahead at the extent
> boundary.
>
> Reported-by: Max Chernoff <git@maxchernoff.ca>
> Link: https://bugzilla.redhat.com/show_bug.cgi?id=2399898
> Fixes: 9e9ff875e417 ("btrfs: use readahead_expand() on compressed extents")
> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
>  fs/btrfs/extent_io.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index dfda8f6da194..3a8681566fc5 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -972,7 +972,7 @@ static void btrfs_readahead_expand(struct readahead_control *ractl,
>  {
>         const u64 ra_pos = readahead_pos(ractl);
>         const u64 ra_end = ra_pos + readahead_length(ractl);
> -       const u64 em_end = em->start + em->ram_bytes;
> +       const u64 em_end = em->start + em->len;
>
>         /* No expansion for holes and inline extents. */
>         if (em->disk_bytenr > EXTENT_MAP_LAST_BYTE)
> --
> 2.50.1
>
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] btrfs: fix incorrect readahead expansion length
  2025-10-01 17:10 ` Filipe Manana
@ 2025-10-01 21:14   ` Boris Burkov
  2025-10-03 12:30     ` Filipe Manana
  0 siblings, 1 reply; 5+ messages in thread
From: Boris Burkov @ 2025-10-01 21:14 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs, kernel-team

On Wed, Oct 01, 2025 at 06:10:21PM +0100, Filipe Manana wrote:
> On Wed, Oct 1, 2025 at 5:51 PM Boris Burkov <boris@bur.io> wrote:
> >
> > The intent of btrfs_readahead_expand() was to expand to the length of
> > the current compressed extent being read. However, "ram_bytes" is *not*
> > that, in the case where a single physical compressed extent is used for
> > multiple file extents.
> >
> > Consider this case with a large compressed extent C and then later two
> > non-compressed extents N1 and N2 written over C, leaving C1 and C2
> > pointing to offset/len pairs of C:
> > [               C                 ]
> > [ N1 ][     C1     ][ N2 ][   C2  ]
> >
> > In such a case, ram_bytes for both C1 and C2 is the full uncompressed
> > length of C. So starting readahead in C1 will expand the readahead past
> > the end of C1, past N2, and into C2. This will then expand readahead
> > again, to C2_start + ram_bytes, way past EOF. First of all, this is
> > totally undesirable, we don't want to read the whole file in arbitrary
> > chunks of the large underlying extent if it happens to exist. Secondly,
> > it results in zeroing the range past the end of C2 up to ram_bytes. This
> > is particularly unpleasant with fs-verity as it can zero and set
> > uptodate pages in the verity virtual space past EOF. This incorrect
> > readahead behavior can lead to verity verification errors, if we iterate
> > in a way that happens to do the wrong readahead.
> 
> So this misses being clear, explicit, about the worst problem:
> buffered read corruption (even when not using verity).
> In that case the readahead loaded data from C into the page cache
> range for N2, so then later anyone doing a buffered read for N2's
> range, will get data from C.

I believe you, but I actually don't see it myself yet. Can you help me
understand?

As I currently see it:

Changing the readahead window will change which folios we call
btrfs_do_readpage() on, but inside btrfs_do_readpage(), we still have
the same logic to force submissions on extent boundaries. Whether due to
holes/inline extents, changing compression types or mismatched em->start
and bio_ctrl->last_em_start for compressed extents.

I have prepared an fstest that is roughly:

        # write a big-ish compressed extent
        _scratch_mount "-o compress-force=zstd:3" >/dev/null 2>&1
        $XFS_IO_PROG -f -c "pwrite -S 0xab 0 65536" $SCRATCH_MNT/foo &>/dev/null

        # put a couple smaller normal extents in over it
        _scratch_unmount
        _scratch_mount "-o compress=none" >/dev/null 2>&1
        $XFS_IO_PROG -f -c "pwrite -S 0xcd 4096 4096" $SCRATCH_MNT/foo &>/dev/null
        $XFS_IO_PROG -f -c "pwrite -S 0xcd 32768 16384" $SCRATCH_MNT/foo &>/dev/null

        # do some verification
        fsverity enable $SCRATCH_MNT/foo
        _scratch_unmount
        _scratch_mount "-o compress=none" >/dev/null 2>&1
        # clean cache read of 1 byte from the compressed extent. File
        # extent size 4096, ram bytes size 64k
        dd if=$SCRATCH_MNT/foo bs=1 count=1 2>/dev/null | _hexdump
        # if the read of "C1" wrote into "N", then we should see it on
        # this read, right?
        dd if=$SCRATCH_MNT/foo bs=1 count=1 skip=4096 2>/dev/null | _hexdump

And it triggers the fsverity errors, but I am not able to make the read
into the uncompressed range see the compressed bytes, yet.

Maybe that will be a clue towards my misunderstanding...

Thanks for the review and help,
Boris

> 
> This should be easy to turn into a test case for fstests too.
> 
> With that changelog update:
> 
> Reviewed-by: Filipe Manana <fdmanana@suse.com>
> 
> >
> > Fix this by using em->len for readahead expansion, not em->ram_bytes,
> > resulting in the expected behavior of stopping readahead at the extent
> > boundary.
> >
> > Reported-by: Max Chernoff <git@maxchernoff.ca>
> > Link: https://bugzilla.redhat.com/show_bug.cgi?id=2399898
> > Fixes: 9e9ff875e417 ("btrfs: use readahead_expand() on compressed extents")
> > Signed-off-by: Boris Burkov <boris@bur.io>
> > ---
> >  fs/btrfs/extent_io.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > index dfda8f6da194..3a8681566fc5 100644
> > --- a/fs/btrfs/extent_io.c
> > +++ b/fs/btrfs/extent_io.c
> > @@ -972,7 +972,7 @@ static void btrfs_readahead_expand(struct readahead_control *ractl,
> >  {
> >         const u64 ra_pos = readahead_pos(ractl);
> >         const u64 ra_end = ra_pos + readahead_length(ractl);
> > -       const u64 em_end = em->start + em->ram_bytes;
> > +       const u64 em_end = em->start + em->len;
> >
> >         /* No expansion for holes and inline extents. */
> >         if (em->disk_bytenr > EXTENT_MAP_LAST_BYTE)
> > --
> > 2.50.1
> >
> >

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] btrfs: fix incorrect readahead expansion length
  2025-10-01 16:50 [PATCH] btrfs: fix incorrect readahead expansion length Boris Burkov
  2025-10-01 17:10 ` Filipe Manana
@ 2025-10-03 11:02 ` Max Chernoff
  1 sibling, 0 replies; 5+ messages in thread
From: Max Chernoff @ 2025-10-03 11:02 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, kernel-team

Hi Boris,

On Wed, 2025-10-01 at 09:50 -0700, Boris Burkov wrote:
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index dfda8f6da194..3a8681566fc5 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -972,7 +972,7 @@ static void btrfs_readahead_expand(struct readahead_control *ractl,
>  {
>  	const u64 ra_pos = readahead_pos(ractl);
>  	const u64 ra_end = ra_pos + readahead_length(ractl);
> -	const u64 em_end = em->start + em->ram_bytes;
> +	const u64 em_end = em->start + em->len;
>
>  	/* No expansion for holes and inline extents. */
>  	if (em->disk_bytenr > EXTENT_MAP_LAST_BYTE)

I've tested this patch on top of 6.16.8 with the minimal disk image, and
on top of 6.16.9 on the machine where I originally encountered the
errors, and can confirm that it fixed the problem in both cases.

Thanks again for all your help,
-- Max

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] btrfs: fix incorrect readahead expansion length
  2025-10-01 21:14   ` Boris Burkov
@ 2025-10-03 12:30     ` Filipe Manana
  0 siblings, 0 replies; 5+ messages in thread
From: Filipe Manana @ 2025-10-03 12:30 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, kernel-team

On Wed, Oct 1, 2025 at 10:14 PM Boris Burkov <boris@bur.io> wrote:
>
> On Wed, Oct 01, 2025 at 06:10:21PM +0100, Filipe Manana wrote:
> > On Wed, Oct 1, 2025 at 5:51 PM Boris Burkov <boris@bur.io> wrote:
> > >
> > > The intent of btrfs_readahead_expand() was to expand to the length of
> > > the current compressed extent being read. However, "ram_bytes" is *not*
> > > that, in the case where a single physical compressed extent is used for
> > > multiple file extents.
> > >
> > > Consider this case with a large compressed extent C and then later two
> > > non-compressed extents N1 and N2 written over C, leaving C1 and C2
> > > pointing to offset/len pairs of C:
> > > [               C                 ]
> > > [ N1 ][     C1     ][ N2 ][   C2  ]
> > >
> > > In such a case, ram_bytes for both C1 and C2 is the full uncompressed
> > > length of C. So starting readahead in C1 will expand the readahead past
> > > the end of C1, past N2, and into C2. This will then expand readahead
> > > again, to C2_start + ram_bytes, way past EOF. First of all, this is
> > > totally undesirable, we don't want to read the whole file in arbitrary
> > > chunks of the large underlying extent if it happens to exist. Secondly,
> > > it results in zeroing the range past the end of C2 up to ram_bytes. This
> > > is particularly unpleasant with fs-verity as it can zero and set
> > > uptodate pages in the verity virtual space past EOF. This incorrect
> > > readahead behavior can lead to verity verification errors, if we iterate
> > > in a way that happens to do the wrong readahead.
> >
> > So this misses being clear, explicit, about the worst problem:
> > buffered read corruption (even when not using verity).
> > In that case the readahead loaded data from C into the page cache
> > range for N2, so then later anyone doing a buffered read for N2's
> > range, will get data from C.
>
> I believe you, but I actually don't see it myself yet. Can you help me
> understand?
>
> As I currently see it:
>
> Changing the readahead window will change which folios we call
> btrfs_do_readpage() on, but inside btrfs_do_readpage(), we still have
> the same logic to force submissions on extent boundaries. Whether due to
> holes/inline extents, changing compression types or mismatched em->start
> and bio_ctrl->last_em_start for compressed extents.

Oh yes, the last_em_start thing, I forgot about it, though I
introduced it years ago because of multiple file extent items pointing
to parts of the same compressed extent leading to read corruption in
that kind of scenario I was mentioning.

So it's fine.
You can add my RB tag and push it to for-next so that we get this
merged into Linus' tree asap.

Thanks.

>
> I have prepared an fstest that is roughly:
>
>         # write a big-ish compressed extent
>         _scratch_mount "-o compress-force=zstd:3" >/dev/null 2>&1
>         $XFS_IO_PROG -f -c "pwrite -S 0xab 0 65536" $SCRATCH_MNT/foo &>/dev/null
>
>         # put a couple smaller normal extents in over it
>         _scratch_unmount
>         _scratch_mount "-o compress=none" >/dev/null 2>&1
>         $XFS_IO_PROG -f -c "pwrite -S 0xcd 4096 4096" $SCRATCH_MNT/foo &>/dev/null
>         $XFS_IO_PROG -f -c "pwrite -S 0xcd 32768 16384" $SCRATCH_MNT/foo &>/dev/null
>
>         # do some verification
>         fsverity enable $SCRATCH_MNT/foo
>         _scratch_unmount
>         _scratch_mount "-o compress=none" >/dev/null 2>&1
>         # clean cache read of 1 byte from the compressed extent. File
>         # extent size 4096, ram bytes size 64k
>         dd if=$SCRATCH_MNT/foo bs=1 count=1 2>/dev/null | _hexdump
>         # if the read of "C1" wrote into "N", then we should see it on
>         # this read, right?
>         dd if=$SCRATCH_MNT/foo bs=1 count=1 skip=4096 2>/dev/null | _hexdump
>
> And it triggers the fsverity errors, but I am not able to make the read
> into the uncompressed range see the compressed bytes, yet.
>
> Maybe that will be a clue towards my misunderstanding...
>
> Thanks for the review and help,
> Boris
>
> >
> > This should be easy to turn into a test case for fstests too.
> >
> > With that changelog update:
> >
> > Reviewed-by: Filipe Manana <fdmanana@suse.com>
> >
> > >
> > > Fix this by using em->len for readahead expansion, not em->ram_bytes,
> > > resulting in the expected behavior of stopping readahead at the extent
> > > boundary.
> > >
> > > Reported-by: Max Chernoff <git@maxchernoff.ca>
> > > Link: https://bugzilla.redhat.com/show_bug.cgi?id=2399898
> > > Fixes: 9e9ff875e417 ("btrfs: use readahead_expand() on compressed extents")
> > > Signed-off-by: Boris Burkov <boris@bur.io>
> > > ---
> > >  fs/btrfs/extent_io.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > > index dfda8f6da194..3a8681566fc5 100644
> > > --- a/fs/btrfs/extent_io.c
> > > +++ b/fs/btrfs/extent_io.c
> > > @@ -972,7 +972,7 @@ static void btrfs_readahead_expand(struct readahead_control *ractl,
> > >  {
> > >         const u64 ra_pos = readahead_pos(ractl);
> > >         const u64 ra_end = ra_pos + readahead_length(ractl);
> > > -       const u64 em_end = em->start + em->ram_bytes;
> > > +       const u64 em_end = em->start + em->len;
> > >
> > >         /* No expansion for holes and inline extents. */
> > >         if (em->disk_bytenr > EXTENT_MAP_LAST_BYTE)
> > > --
> > > 2.50.1
> > >
> > >

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-10-03 12:31 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-01 16:50 [PATCH] btrfs: fix incorrect readahead expansion length Boris Burkov
2025-10-01 17:10 ` Filipe Manana
2025-10-01 21:14   ` Boris Burkov
2025-10-03 12:30     ` Filipe Manana
2025-10-03 11:02 ` Max Chernoff

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox