Linux Btrfs filesystem development
 help / color / mirror / Atom feed
From: Boris Burkov <boris@bur.io>
To: Filipe Manana <fdmanana@kernel.org>
Cc: linux-btrfs@vger.kernel.org, kernel-team@fb.com
Subject: Re: [PATCH] btrfs: fix incorrect readahead expansion length
Date: Wed, 1 Oct 2025 14:14:10 -0700	[thread overview]
Message-ID: <20251001211410.GA2927167@zen.localdomain> (raw)
In-Reply-To: <CAL3q7H4uPf0+dV=7-x4GyfqU2SxW1uzr5iT32aH10Pupa6r81g@mail.gmail.com>

On Wed, Oct 01, 2025 at 06:10:21PM +0100, Filipe Manana wrote:
> On Wed, Oct 1, 2025 at 5:51 PM Boris Burkov <boris@bur.io> wrote:
> >
> > The intent of btrfs_readahead_expand() was to expand to the length of
> > the current compressed extent being read. However, "ram_bytes" is *not*
> > that, in the case where a single physical compressed extent is used for
> > multiple file extents.
> >
> > Consider this case with a large compressed extent C and then later two
> > non-compressed extents N1 and N2 written over C, leaving C1 and C2
> > pointing to offset/len pairs of C:
> > [               C                 ]
> > [ N1 ][     C1     ][ N2 ][   C2  ]
> >
> > In such a case, ram_bytes for both C1 and C2 is the full uncompressed
> > length of C. So starting readahead in C1 will expand the readahead past
> > the end of C1, past N2, and into C2. This will then expand readahead
> > again, to C2_start + ram_bytes, way past EOF. First of all, this is
> > totally undesirable, we don't want to read the whole file in arbitrary
> > chunks of the large underlying extent if it happens to exist. Secondly,
> > it results in zeroing the range past the end of C2 up to ram_bytes. This
> > is particularly unpleasant with fs-verity as it can zero and set
> > uptodate pages in the verity virtual space past EOF. This incorrect
> > readahead behavior can lead to verity verification errors, if we iterate
> > in a way that happens to do the wrong readahead.
> 
> So this misses being clear, explicit, about the worst problem:
> buffered read corruption (even when not using verity).
> In that case the readahead loaded data from C into the page cache
> range for N2, so then later anyone doing a buffered read for N2's
> range, will get data from C.

I believe you, but I actually don't see it myself yet. Can you help me
understand?

As I currently see it:

Changing the readahead window will change which folios we call
btrfs_do_readpage() on, but inside btrfs_do_readpage(), we still have
the same logic to force submissions on extent boundaries. Whether due to
holes/inline extents, changing compression types or mismatched em->start
and bio_ctrl->last_em_start for compressed extents.

I have prepared an fstest that is roughly:

        # write a big-ish compressed extent
        _scratch_mount "-o compress-force=zstd:3" >/dev/null 2>&1
        $XFS_IO_PROG -f -c "pwrite -S 0xab 0 65536" $SCRATCH_MNT/foo &>/dev/null

        # put a couple smaller normal extents in over it
        _scratch_unmount
        _scratch_mount "-o compress=none" >/dev/null 2>&1
        $XFS_IO_PROG -f -c "pwrite -S 0xcd 4096 4096" $SCRATCH_MNT/foo &>/dev/null
        $XFS_IO_PROG -f -c "pwrite -S 0xcd 32768 16384" $SCRATCH_MNT/foo &>/dev/null

        # do some verification
        fsverity enable $SCRATCH_MNT/foo
        _scratch_unmount
        _scratch_mount "-o compress=none" >/dev/null 2>&1
        # clean cache read of 1 byte from the compressed extent. File
        # extent size 4096, ram bytes size 64k
        dd if=$SCRATCH_MNT/foo bs=1 count=1 2>/dev/null | _hexdump
        # if the read of "C1" wrote into "N", then we should see it on
        # this read, right?
        dd if=$SCRATCH_MNT/foo bs=1 count=1 skip=4096 2>/dev/null | _hexdump

And it triggers the fsverity errors, but I am not able to make the read
into the uncompressed range see the compressed bytes, yet.

Maybe that will be a clue towards my misunderstanding...

Thanks for the review and help,
Boris

> 
> This should be easy to turn into a test case for fstests too.
> 
> With that changelog update:
> 
> Reviewed-by: Filipe Manana <fdmanana@suse.com>
> 
> >
> > Fix this by using em->len for readahead expansion, not em->ram_bytes,
> > resulting in the expected behavior of stopping readahead at the extent
> > boundary.
> >
> > Reported-by: Max Chernoff <git@maxchernoff.ca>
> > Link: https://bugzilla.redhat.com/show_bug.cgi?id=2399898
> > Fixes: 9e9ff875e417 ("btrfs: use readahead_expand() on compressed extents")
> > Signed-off-by: Boris Burkov <boris@bur.io>
> > ---
> >  fs/btrfs/extent_io.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > index dfda8f6da194..3a8681566fc5 100644
> > --- a/fs/btrfs/extent_io.c
> > +++ b/fs/btrfs/extent_io.c
> > @@ -972,7 +972,7 @@ static void btrfs_readahead_expand(struct readahead_control *ractl,
> >  {
> >         const u64 ra_pos = readahead_pos(ractl);
> >         const u64 ra_end = ra_pos + readahead_length(ractl);
> > -       const u64 em_end = em->start + em->ram_bytes;
> > +       const u64 em_end = em->start + em->len;
> >
> >         /* No expansion for holes and inline extents. */
> >         if (em->disk_bytenr > EXTENT_MAP_LAST_BYTE)
> > --
> > 2.50.1
> >
> >

  reply	other threads:[~2025-10-01 21:14 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-01 16:50 [PATCH] btrfs: fix incorrect readahead expansion length Boris Burkov
2025-10-01 17:10 ` Filipe Manana
2025-10-01 21:14   ` Boris Burkov [this message]
2025-10-03 12:30     ` Filipe Manana
2025-10-03 11:02 ` Max Chernoff

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251001211410.GA2927167@zen.localdomain \
    --to=boris@bur.io \
    --cc=fdmanana@kernel.org \
    --cc=kernel-team@fb.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox