linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2] btrfs: do more strict compressed read merge check
@ 2025-08-23 23:24 Qu Wenruo
  2025-08-25 10:42 ` Filipe Manana
  0 siblings, 1 reply; 3+ messages in thread
From: Qu Wenruo @ 2025-08-23 23:24 UTC (permalink / raw)
  To: linux-btrfs; +Cc: stable

[BUG]
When running test case generic/457, there is a chance to hit the
following error, with 64K page size and 4K btrfs block size, and
"compress=zstd" mount option:

FSTYP         -- btrfs
PLATFORM      -- Linux/aarch64 btrfs-aarch64 6.17.0-rc2-custom+ #129 SMP PREEMPT_DYNAMIC Wed Aug 20 18:52:51 ACST 2025
MKFS_OPTIONS  -- -s 4k /dev/mapper/test-scratch1
MOUNT_OPTIONS -- -o compress=zstd /dev/mapper/test-scratch1 /mnt/scratch

generic/457 2s ... [failed, exit status 1]- output mismatch (see /home/adam/xfstests-dev/results//generic/457.out.bad)
    --- tests/generic/457.out	2024-04-25 18:13:45.160550980 +0930
    +++ /home/adam/xfstests-dev/results//generic/457.out.bad	2025-08-22 16:09:41.039352391 +0930
    @@ -1,2 +1,3 @@
     QA output created by 457
    -Silence is golden
    +testfile6 end md5sum mismatched
    +(see /home/adam/xfstests-dev/results//generic/457.full for details)
    ...
    (Run 'diff -u /home/adam/xfstests-dev/tests/generic/457.out /home/adam/xfstests-dev/results//generic/457.out.bad'  to see the entire diff)

The root problem is, after certain fsx operations the file contents
change just after a mount cycle.

There is a much smaller reproducer based on that test case, which I
mainly used to debug the bug:

workload() {
	mkfs.btrfs -f $dev > /dev/null
	dmesg -C
	trace-cmd clear
	mount -o compress=zstd $dev $mnt
	xfs_io -f -c "pwrite -S 0xff 0 256K" -c "sync" $mnt/base > /dev/null
	cp --reflink=always -p -f $mnt/base $mnt/file
	$fsx -N 4 -d -k -S 3746842 $mnt/file
	if [ $? -ne 0 ]; then
		echo "!!! FSX FAILURE !!!"
		fail
	fi
	csum_before=$(_md5_checksum $mnt/file)
	stop_trace
	umount $mnt
	mount $dev $mnt
	csum_after=$(_md5_checksum $mnt/file)
	umount $mnt
	if [ "$csum_before" != "$csum_after" ]; then
		echo "!!! CSUM MISMATCH !!!"
		fail
	fi
}

This seed value will cause 100% reproducible csum mismatch after a mount
cycle.

The seed value results only 2 real operations:

  Seed set to 3746842
  main: filesystem does not support fallocate mode FALLOC_FL_UNSHARE_RANGE, disabling!
  main: filesystem does not support fallocate mode FALLOC_FL_COLLAPSE_RANGE, disabling!
  main: filesystem does not support fallocate mode FALLOC_FL_INSERT_RANGE, disabling!
  main: filesystem does not support exchange range, disabling!
  main: filesystem does not support dontcache IO, disabling!
  2 clone	from 0x3b000 to 0x3f000, (0x4000 bytes) at 0x1f000
  3 write	0x2975b thru	0x2ba20	(0x22c6 bytes)	dontcache=0
  All 4 operations completed A-OK!

[CAUSE]
With extra debug trace_printk(), the following sequence can explain the
root cause:

  fsx-3900290 [002] ..... 161696.160966: btrfs_submit_compressed_read: r/i=5/258 file_off=131072 em start=126976 len=16384

The "r/i" is showing the root id and the ino number.
In this case, my minimal reproducer is indeed using inode 258 of
subvolume 5, and that's the inode with changing contents.

The above trace is from the function btrfs_submit_compressed_read(),
triggered by fsx to read the folio at file offset 128K.

Notice that the extent map, it's at offset 124K, with a length of 16K.
This means the extent map only covers the first 12K (3 blocks) of the
folio 128K.

  fsx-3900290 [002] ..... 161696.160969: trace_dump_cb: btrfs_submit_compressed_read, r/i=5/258 file off start=131072 len=65536 bi_size=65536

This is the line I used to dump the basic info of a bbio, which shows the
bi_size is 64K, aka covering the whole 64K folio at file offset 128K.

But remember, the extent map only covers 3 blocks, definitely not enough
to cover the whole 64K folio at 128K file offset.

  kworker/u19:1-3748349 [002] ..... 161696.161154: btrfs_decompress_buf2page: r/i=5/258 file_off=131072 copy_len=4096 content=ffff
  kworker/u19:1-3748349 [002] ..... 161696.161155: btrfs_decompress_buf2page: r/i=5/258 file_off=135168 copy_len=4096 content=ffff
  kworker/u19:1-3748349 [002] ..... 161696.161156: btrfs_decompress_buf2page: r/i=5/258 file_off=139264 copy_len=4096 content=ffff
  kworker/u19:1-3748349 [002] ..... 161696.161157: btrfs_decompress_buf2page: r/i=5/258 file_off=143360 copy_len=4096 content=ffff

The above lines show that btrfs_decompress_buf2page() called by zstd
decompress code is copying the decompressed content into the filemap.

But notice that, the last line is already beyond the extent map range.

Furthermore, there are no more compressed content copy, as the
compressed bio only has the extent map to cover the first 3 blocks (the
4th block copy is already incorrect).

   kworker/u19:1-3748349 [002] ..... 161696.161161: trace_dump_cb: r/i=5/258 file_pos=131072 content=ffff
   kworker/u19:1-3748349 [002] ..... 161696.161161: trace_dump_cb: r/i=5/258 file_pos=135168 content=ffff
   kworker/u19:1-3748349 [002] ..... 161696.161162: trace_dump_cb: r/i=5/258 file_pos=139264 content=ffff
   kworker/u19:1-3748349 [002] ..... 161696.161162: trace_dump_cb: r/i=5/258 file_pos=143360 content=ffff
   kworker/u19:1-3748349 [002] ..... 161696.161162: trace_dump_cb: r/i=5/258 file_pos=147456 content=0000

This is the extra dumpping of the compressed bio, after file offset
140K (143360), the content is all zero, which is incorrect.
The zero is there because we didn't copy anything into the folio.

The root cause of the corruption is, we are submitting a compressed read
for a whole folio, but the extent map we get only covers the first 3
blocks, meaning the compressed read path is merging reads that shouldn't
be merged.

The involved file extents are:

        item 19 key (258 EXTENT_DATA 126976) itemoff 15143 itemsize 53
                generation 9 type 1 (regular)
                extent data disk byte 13635584 nr 4096
                extent data offset 110592 nr 16384 ram 131072
                extent compression 3 (zstd)
        item 20 key (258 EXTENT_DATA 143360) itemoff 15090 itemsize 53
                generation 9 type 1 (regular)
                extent data disk byte 13635584 nr 4096
                extent data offset 12288 nr 24576 ram 131072
                extent compression 3 (zstd)

Note that, both extents at 124K and 140K are pointing to the same
compressed extent, but with different offset.

This means, we reads of range [124K, 140K) and [140K, 165K) should not
be merged.

But read merge check function, btrfs_bio_is_contig(), is only checking
the disk_bytenr of two compressed reads, as there are not enough info
like the involved extent maps to do more comprehensive checks, resulting
the incorrect compressed read.

Unfortunately this is a long existing bug, way before subpage block size
support.

But subpage block size support (and experimental large folio support)
makes it much easier to detect.

If block size equals page size, regular page read will only read one
block each time, thus no extent map sharing nor merge.

(This means for bs == ps cases, it's still possible to hit the bug with
readahead, just we don't have test coverage with content verification
for readahead)

[FIX]
Save the last hit compressed extent map start/len into btrfs_bio_ctrl,
and check if the current extent map is the same as the saved one.

Here we only save em::start/len to save memory for btrfs_bio_ctrl, as
it's using the stack memory, which is a very limited resource inside the
kernel.

Since the compressed extent maps are never merged, their start/len are
unique inside the same inode, thus just checking start/len will be
enough to make sure they are the same extent map.

If the extent maps do not match, force submitting the current bio, so
that the read will never be merged.

CC: stable@vger.kernel.org
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
v2:
- Only save extent_map::start/len to save memory for btrfs_bio_ctrl
  It's using on-stack memory which is very limited inside the kernel.

- Remove the commit message mentioning of clearing last saved em
  Since we're using em::start/len, there is no need to clear them.
  Either we hit the same em::start/len, meaning hitting the same extent
  map, or we hit a different em, which will have a different start/len.
---
 fs/btrfs/extent_io.c | 52 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 52 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 0c12fd64a1f3..418e3bf40f94 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -131,6 +131,22 @@ struct btrfs_bio_ctrl {
 	 */
 	unsigned long submit_bitmap;
 	struct readahead_control *ractl;
+
+	/*
+	 * The start/len of the last hit compressed extent map.
+	 *
+	 * The current btrfs_bio_is_contig() only uses disk_bytenr as
+	 * the condition to check if the read can be merged with previous
+	 * bio, which is not correct. E.g. two file extents pointing to the
+	 * same extent.
+	 *
+	 * So here we need to do extra check to merge reads that are
+	 * covered by the same extent map.
+	 * Just extent_map::start/len will be enough, as they are unique
+	 * inside the same inode.
+	 */
+	u64 last_compress_em_start;
+	u64 last_compress_em_len;
 };
 
 /*
@@ -957,6 +973,32 @@ static void btrfs_readahead_expand(struct readahead_control *ractl,
 		readahead_expand(ractl, ra_pos, em_end - ra_pos);
 }
 
+static void save_compressed_em(struct btrfs_bio_ctrl *bio_ctrl,
+			       const struct extent_map *em)
+{
+	if (btrfs_extent_map_compression(em) == BTRFS_COMPRESS_NONE)
+		return;
+	bio_ctrl->last_compress_em_start = em->start;
+	bio_ctrl->last_compress_em_len = em->len;
+}
+
+static bool is_same_compressed_em(struct btrfs_bio_ctrl *bio_ctrl,
+				  const struct extent_map *em)
+{
+	/*
+	 * Only if the em is completely the same as the previous one we can merge
+	 * the current folio in the read bio.
+	 *
+	 * Here we only need to compare the em->start/len against saved
+	 * last_compress_em_start/len, as start/len inside an inode are unique,
+	 * and compressed extent maps are never merged.
+	 */
+	if (em->start != bio_ctrl->last_compress_em_start ||
+	    em->len != bio_ctrl->last_compress_em_len)
+		return false;
+	return true;
+}
+
 /*
  * basic readpage implementation.  Locked extent state structs are inserted
  * into the tree that are removed when the IO is done (by the end_io
@@ -1080,9 +1122,19 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
 		    *prev_em_start != em->start)
 			force_bio_submit = true;
 
+		/*
+		 * We must ensure we only merge compressed read when the current
+		 * extent map matches the previous one exactly.
+		 */
+		if (compress_type != BTRFS_COMPRESS_NONE) {
+			if (!is_same_compressed_em(bio_ctrl, em))
+				force_bio_submit = true;
+		}
+
 		if (prev_em_start)
 			*prev_em_start = em->start;
 
+		save_compressed_em(bio_ctrl, em);
 		em_gen = em->generation;
 		btrfs_free_extent_map(em);
 		em = NULL;
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH v2] btrfs: do more strict compressed read merge check
  2025-08-23 23:24 [PATCH v2] btrfs: do more strict compressed read merge check Qu Wenruo
@ 2025-08-25 10:42 ` Filipe Manana
  2025-08-25 22:22   ` Qu Wenruo
  0 siblings, 1 reply; 3+ messages in thread
From: Filipe Manana @ 2025-08-25 10:42 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, stable

On Sun, Aug 24, 2025 at 12:27 AM Qu Wenruo <wqu@suse.com> wrote:
>
> [BUG]
> When running test case generic/457, there is a chance to hit the
> following error, with 64K page size and 4K btrfs block size, and
> "compress=zstd" mount option:
>
> FSTYP         -- btrfs
> PLATFORM      -- Linux/aarch64 btrfs-aarch64 6.17.0-rc2-custom+ #129 SMP PREEMPT_DYNAMIC Wed Aug 20 18:52:51 ACST 2025
> MKFS_OPTIONS  -- -s 4k /dev/mapper/test-scratch1
> MOUNT_OPTIONS -- -o compress=zstd /dev/mapper/test-scratch1 /mnt/scratch
>
> generic/457 2s ... [failed, exit status 1]- output mismatch (see /home/adam/xfstests-dev/results//generic/457.out.bad)
>     --- tests/generic/457.out   2024-04-25 18:13:45.160550980 +0930
>     +++ /home/adam/xfstests-dev/results//generic/457.out.bad    2025-08-22 16:09:41.039352391 +0930
>     @@ -1,2 +1,3 @@
>      QA output created by 457
>     -Silence is golden
>     +testfile6 end md5sum mismatched
>     +(see /home/adam/xfstests-dev/results//generic/457.full for details)
>     ...
>     (Run 'diff -u /home/adam/xfstests-dev/tests/generic/457.out /home/adam/xfstests-dev/results//generic/457.out.bad'  to see the entire diff)
>
> The root problem is, after certain fsx operations the file contents
> change just after a mount cycle.
>
> There is a much smaller reproducer based on that test case, which I
> mainly used to debug the bug:
>
> workload() {
>         mkfs.btrfs -f $dev > /dev/null
>         dmesg -C
>         trace-cmd clear
>         mount -o compress=zstd $dev $mnt
>         xfs_io -f -c "pwrite -S 0xff 0 256K" -c "sync" $mnt/base > /dev/null
>         cp --reflink=always -p -f $mnt/base $mnt/file
>         $fsx -N 4 -d -k -S 3746842 $mnt/file
>         if [ $? -ne 0 ]; then
>                 echo "!!! FSX FAILURE !!!"
>                 fail
>         fi
>         csum_before=$(_md5_checksum $mnt/file)
>         stop_trace
>         umount $mnt
>         mount $dev $mnt
>         csum_after=$(_md5_checksum $mnt/file)
>         umount $mnt
>         if [ "$csum_before" != "$csum_after" ]; then
>                 echo "!!! CSUM MISMATCH !!!"
>                 fail
>         fi
> }
>
> This seed value will cause 100% reproducible csum mismatch after a mount
> cycle.
>
> The seed value results only 2 real operations:
>
>   Seed set to 3746842
>   main: filesystem does not support fallocate mode FALLOC_FL_UNSHARE_RANGE, disabling!
>   main: filesystem does not support fallocate mode FALLOC_FL_COLLAPSE_RANGE, disabling!
>   main: filesystem does not support fallocate mode FALLOC_FL_INSERT_RANGE, disabling!
>   main: filesystem does not support exchange range, disabling!
>   main: filesystem does not support dontcache IO, disabling!
>   2 clone       from 0x3b000 to 0x3f000, (0x4000 bytes) at 0x1f000
>   3 write       0x2975b thru    0x2ba20 (0x22c6 bytes)  dontcache=0
>   All 4 operations completed A-OK!

Can you please make a test case for fstests?
Without fsx of course, since when new operations are added or fsx is
changed in other ways, the same seed is likely to not exercise the bug
anymore.

Just using xfs_io (writes, cloning, etc).

>
> [CAUSE]
> With extra debug trace_printk(), the following sequence can explain the
> root cause:
>
>   fsx-3900290 [002] ..... 161696.160966: btrfs_submit_compressed_read: r/i=5/258 file_off=131072 em start=126976 len=16384
>
> The "r/i" is showing the root id and the ino number.
> In this case, my minimal reproducer is indeed using inode 258 of
> subvolume 5, and that's the inode with changing contents.
>
> The above trace is from the function btrfs_submit_compressed_read(),
> triggered by fsx to read the folio at file offset 128K.
>
> Notice that the extent map, it's at offset 124K, with a length of 16K.
> This means the extent map only covers the first 12K (3 blocks) of the
> folio 128K.
>
>   fsx-3900290 [002] ..... 161696.160969: trace_dump_cb: btrfs_submit_compressed_read, r/i=5/258 file off start=131072 len=65536 bi_size=65536
>
> This is the line I used to dump the basic info of a bbio, which shows the
> bi_size is 64K, aka covering the whole 64K folio at file offset 128K.
>
> But remember, the extent map only covers 3 blocks, definitely not enough
> to cover the whole 64K folio at 128K file offset.
>
>   kworker/u19:1-3748349 [002] ..... 161696.161154: btrfs_decompress_buf2page: r/i=5/258 file_off=131072 copy_len=4096 content=ffff
>   kworker/u19:1-3748349 [002] ..... 161696.161155: btrfs_decompress_buf2page: r/i=5/258 file_off=135168 copy_len=4096 content=ffff
>   kworker/u19:1-3748349 [002] ..... 161696.161156: btrfs_decompress_buf2page: r/i=5/258 file_off=139264 copy_len=4096 content=ffff
>   kworker/u19:1-3748349 [002] ..... 161696.161157: btrfs_decompress_buf2page: r/i=5/258 file_off=143360 copy_len=4096 content=ffff
>
> The above lines show that btrfs_decompress_buf2page() called by zstd
> decompress code is copying the decompressed content into the filemap.
>
> But notice that, the last line is already beyond the extent map range.
>
> Furthermore, there are no more compressed content copy, as the
> compressed bio only has the extent map to cover the first 3 blocks (the
> 4th block copy is already incorrect).
>
>    kworker/u19:1-3748349 [002] ..... 161696.161161: trace_dump_cb: r/i=5/258 file_pos=131072 content=ffff
>    kworker/u19:1-3748349 [002] ..... 161696.161161: trace_dump_cb: r/i=5/258 file_pos=135168 content=ffff
>    kworker/u19:1-3748349 [002] ..... 161696.161162: trace_dump_cb: r/i=5/258 file_pos=139264 content=ffff
>    kworker/u19:1-3748349 [002] ..... 161696.161162: trace_dump_cb: r/i=5/258 file_pos=143360 content=ffff
>    kworker/u19:1-3748349 [002] ..... 161696.161162: trace_dump_cb: r/i=5/258 file_pos=147456 content=0000
>
> This is the extra dumpping of the compressed bio, after file offset
> 140K (143360), the content is all zero, which is incorrect.
> The zero is there because we didn't copy anything into the folio.
>
> The root cause of the corruption is, we are submitting a compressed read
> for a whole folio, but the extent map we get only covers the first 3
> blocks, meaning the compressed read path is merging reads that shouldn't
> be merged.
>
> The involved file extents are:
>
>         item 19 key (258 EXTENT_DATA 126976) itemoff 15143 itemsize 53
>                 generation 9 type 1 (regular)
>                 extent data disk byte 13635584 nr 4096
>                 extent data offset 110592 nr 16384 ram 131072
>                 extent compression 3 (zstd)
>         item 20 key (258 EXTENT_DATA 143360) itemoff 15090 itemsize 53
>                 generation 9 type 1 (regular)
>                 extent data disk byte 13635584 nr 4096
>                 extent data offset 12288 nr 24576 ram 131072
>                 extent compression 3 (zstd)
>
> Note that, both extents at 124K and 140K are pointing to the same
> compressed extent, but with different offset.
>
> This means, we reads of range [124K, 140K) and [140K, 165K) should not

"we reads of range..." -> the reads of ranges...

> be merged.
>
> But read merge check function, btrfs_bio_is_contig(), is only checking
> the disk_bytenr of two compressed reads, as there are not enough info
> like the involved extent maps to do more comprehensive checks, resulting
> the incorrect compressed read.

So this is a variant of similar problems solved in the past where for
compressed extents we merged when we shouldn't have:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=005efedf2c7d0a270ffbe28d8997b03844f3e3e7
http://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=808f80b46790f27e145c72112189d6a3be2bc884
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8e928218780e2f1cf2f5891c7575e8f0b284fcce

And we have test cases btrfs/103, btrfs/106 and btrfs/183 that
exercise those past pugs.

>
> Unfortunately this is a long existing bug, way before subpage block size
> support.
>
> But subpage block size support (and experimental large folio support)
> makes it much easier to detect.
>
> If block size equals page size, regular page read will only read one
> block each time, thus no extent map sharing nor merge.
>
> (This means for bs == ps cases, it's still possible to hit the bug with
> readahead, just we don't have test coverage with content verification
> for readahead)
>
> [FIX]
> Save the last hit compressed extent map start/len into btrfs_bio_ctrl,
> and check if the current extent map is the same as the saved one.
>
> Here we only save em::start/len to save memory for btrfs_bio_ctrl, as
> it's using the stack memory, which is a very limited resource inside the
> kernel.
>
> Since the compressed extent maps are never merged, their start/len are
> unique inside the same inode, thus just checking start/len will be
> enough to make sure they are the same extent map.
>
> If the extent maps do not match, force submitting the current bio, so
> that the read will never be merged.
>
> CC: stable@vger.kernel.org
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> v2:
> - Only save extent_map::start/len to save memory for btrfs_bio_ctrl
>   It's using on-stack memory which is very limited inside the kernel.
>
> - Remove the commit message mentioning of clearing last saved em
>   Since we're using em::start/len, there is no need to clear them.
>   Either we hit the same em::start/len, meaning hitting the same extent
>   map, or we hit a different em, which will have a different start/len.
> ---
>  fs/btrfs/extent_io.c | 52 ++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 52 insertions(+)
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 0c12fd64a1f3..418e3bf40f94 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -131,6 +131,22 @@ struct btrfs_bio_ctrl {
>          */
>         unsigned long submit_bitmap;
>         struct readahead_control *ractl;
> +
> +       /*
> +        * The start/len of the last hit compressed extent map.
> +        *
> +        * The current btrfs_bio_is_contig() only uses disk_bytenr as
> +        * the condition to check if the read can be merged with previous
> +        * bio, which is not correct. E.g. two file extents pointing to the
> +        * same extent.
> +        *
> +        * So here we need to do extra check to merge reads that are
> +        * covered by the same extent map.
> +        * Just extent_map::start/len will be enough, as they are unique
> +        * inside the same inode.
> +        */
> +       u64 last_compress_em_start;
> +       u64 last_compress_em_len;
>  };
>
>  /*
> @@ -957,6 +973,32 @@ static void btrfs_readahead_expand(struct readahead_control *ractl,
>                 readahead_expand(ractl, ra_pos, em_end - ra_pos);
>  }
>
> +static void save_compressed_em(struct btrfs_bio_ctrl *bio_ctrl,
> +                              const struct extent_map *em)
> +{
> +       if (btrfs_extent_map_compression(em) == BTRFS_COMPRESS_NONE)
> +               return;
> +       bio_ctrl->last_compress_em_start = em->start;
> +       bio_ctrl->last_compress_em_len = em->len;
> +}

Something so simple can be open coded instead in the single caller...

> +
> +static bool is_same_compressed_em(struct btrfs_bio_ctrl *bio_ctrl,
> +                                 const struct extent_map *em)
> +{
> +       /*
> +        * Only if the em is completely the same as the previous one we can merge
> +        * the current folio in the read bio.
> +        *
> +        * Here we only need to compare the em->start/len against saved
> +        * last_compress_em_start/len, as start/len inside an inode are unique,
> +        * and compressed extent maps are never merged.
> +        */
> +       if (em->start != bio_ctrl->last_compress_em_start ||
> +           em->len != bio_ctrl->last_compress_em_len)
> +               return false;

Same here. In fact we already have part of this logic in the caller, see below.


> +       return true;
> +}
> +
>  /*
>   * basic readpage implementation.  Locked extent state structs are inserted
>   * into the tree that are removed when the IO is done (by the end_io
> @@ -1080,9 +1122,19 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
>                     *prev_em_start != em->start)
>                         force_bio_submit = true;
>
> +               /*
> +                * We must ensure we only merge compressed read when the current
> +                * extent map matches the previous one exactly.
> +                */
> +               if (compress_type != BTRFS_COMPRESS_NONE) {
> +                       if (!is_same_compressed_em(bio_ctrl, em))
> +                               force_bio_submit = true;
> +               }

We already do almost all of this above - we only miss the extent map
length check.
We have the prev_em_start which already stores the start offset of the
last found compressed extent map, so we're duplicating code and logic
here unnecessarily.

Further with this new logic, since last_compress_em_start and
last_compress_em_len are initialized to zero, we always do an
unnecessary split for the first readahead call that spans more than
page/folio.
We need to distinguish the first call and ignore it - that's why
*prev_em_start is initialized to (u64)-1 and we check that special
value above.



> +
>                 if (prev_em_start)
>                         *prev_em_start = em->start;
>
> +               save_compressed_em(bio_ctrl, em);

This could have been placed under the check for compress_type !=
BTRFS_COMPRESS_NONE...

In other words, this could be simplified to this:

(also at https://pastebin.com/raw/tZHq7icd)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 0c12fd64a1f3..a982277f852b 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -131,6 +131,22 @@ struct btrfs_bio_ctrl {
  */
  unsigned long submit_bitmap;
  struct readahead_control *ractl;
+
+ /*
+ * The start/len of the last hit compressed extent map.
+ *
+ * The current btrfs_bio_is_contig() only uses disk_bytenr as
+ * the condition to check if the read can be merged with previous
+ * bio, which is not correct. E.g. two file extents pointing to the
+ * same extent.
+ *
+ * So here we need to do extra check to merge reads that are
+ * covered by the same extent map.
+ * Just extent_map::start/len will be enough, as they are unique
+ * inside the same inode.
+ */
+ u64 last_compress_em_start;
+ u64 last_compress_em_len;
 };

 /*
@@ -965,7 +981,7 @@ static void btrfs_readahead_expand(struct
readahead_control *ractl,
  * return 0 on success, otherwise return error
  */
 static int btrfs_do_readpage(struct folio *folio, struct extent_map
**em_cached,
-      struct btrfs_bio_ctrl *bio_ctrl, u64 *prev_em_start)
+     struct btrfs_bio_ctrl *bio_ctrl)
 {
  struct inode *inode = folio->mapping->host;
  struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
@@ -1075,13 +1091,15 @@ static int btrfs_do_readpage(struct folio
*folio, struct extent_map **em_cached,
  * is a corner case so we prioritize correctness over
  * non-optimal behavior (submitting 2 bios for the same extent).
  */
- if (compress_type != BTRFS_COMPRESS_NONE &&
-    prev_em_start && *prev_em_start != (u64)-1 &&
-    *prev_em_start != em->start)
- force_bio_submit = true;
-
- if (prev_em_start)
- *prev_em_start = em->start;
+ if (compress_type != BTRFS_COMPRESS_NONE) {
+ if (bio_ctrl->last_compress_em_start != U64_MAX &&
+    (em->start != bio_ctrl->last_compress_em_start ||
+     em->len != bio_ctrl->last_compress_em_len))
+ force_bio_submit = true;
+
+ bio_ctrl->last_compress_em_start = em->start;
+ bio_ctrl->last_compress_em_len = em->len;
+ }

  em_gen = em->generation;
  btrfs_free_extent_map(em);
@@ -1296,12 +1314,15 @@ int btrfs_read_folio(struct file *file, struct
folio *folio)
  const u64 start = folio_pos(folio);
  const u64 end = start + folio_size(folio) - 1;
  struct extent_state *cached_state = NULL;
- struct btrfs_bio_ctrl bio_ctrl = { .opf = REQ_OP_READ };
+ struct btrfs_bio_ctrl bio_ctrl = {
+ .opf = REQ_OP_READ,
+ .last_compress_em_start = U64_MAX,
+ };
  struct extent_map *em_cached = NULL;
  int ret;

  lock_extents_for_read(inode, start, end, &cached_state);
- ret = btrfs_do_readpage(folio, &em_cached, &bio_ctrl, NULL);
+ ret = btrfs_do_readpage(folio, &em_cached, &bio_ctrl);
  btrfs_unlock_extent(&inode->io_tree, start, end, &cached_state);

  btrfs_free_extent_map(em_cached);
@@ -2641,7 +2662,8 @@ void btrfs_readahead(struct readahead_control *rac)
 {
  struct btrfs_bio_ctrl bio_ctrl = {
  .opf = REQ_OP_READ | REQ_RAHEAD,
- .ractl = rac
+ .ractl = rac,
+ .last_compress_em_start = U64_MAX,
  };
  struct folio *folio;
  struct btrfs_inode *inode = BTRFS_I(rac->mapping->host);
@@ -2649,12 +2671,11 @@ void btrfs_readahead(struct readahead_control *rac)
  const u64 end = start + readahead_length(rac) - 1;
  struct extent_state *cached_state = NULL;
  struct extent_map *em_cached = NULL;
- u64 prev_em_start = (u64)-1;

  lock_extents_for_read(inode, start, end, &cached_state);

  while ((folio = readahead_folio(rac)) != NULL)
- btrfs_do_readpage(folio, &em_cached, &bio_ctrl, &prev_em_start);
+ btrfs_do_readpage(folio, &em_cached, &bio_ctrl);

  btrfs_unlock_extent(&inode->io_tree, start, end, &cached_state);

Thanks.
>                 em_gen = em->generation;
>                 btrfs_free_extent_map(em);
>                 em = NULL;
> --
> 2.50.1
>
>

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH v2] btrfs: do more strict compressed read merge check
  2025-08-25 10:42 ` Filipe Manana
@ 2025-08-25 22:22   ` Qu Wenruo
  0 siblings, 0 replies; 3+ messages in thread
From: Qu Wenruo @ 2025-08-25 22:22 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs, stable



在 2025/8/25 20:12, Filipe Manana 写道:
> On Sun, Aug 24, 2025 at 12:27 AM Qu Wenruo <wqu@suse.com> wrote:
>>
>> [BUG]
>> When running test case generic/457, there is a chance to hit the
>> following error, with 64K page size and 4K btrfs block size, and
>> "compress=zstd" mount option:
>>
>> FSTYP         -- btrfs
>> PLATFORM      -- Linux/aarch64 btrfs-aarch64 6.17.0-rc2-custom+ #129 SMP PREEMPT_DYNAMIC Wed Aug 20 18:52:51 ACST 2025
>> MKFS_OPTIONS  -- -s 4k /dev/mapper/test-scratch1
>> MOUNT_OPTIONS -- -o compress=zstd /dev/mapper/test-scratch1 /mnt/scratch
>>
>> generic/457 2s ... [failed, exit status 1]- output mismatch (see /home/adam/xfstests-dev/results//generic/457.out.bad)
>>      --- tests/generic/457.out   2024-04-25 18:13:45.160550980 +0930
>>      +++ /home/adam/xfstests-dev/results//generic/457.out.bad    2025-08-22 16:09:41.039352391 +0930
>>      @@ -1,2 +1,3 @@
>>       QA output created by 457
>>      -Silence is golden
>>      +testfile6 end md5sum mismatched
>>      +(see /home/adam/xfstests-dev/results//generic/457.full for details)
>>      ...
>>      (Run 'diff -u /home/adam/xfstests-dev/tests/generic/457.out /home/adam/xfstests-dev/results//generic/457.out.bad'  to see the entire diff)
>>
>> The root problem is, after certain fsx operations the file contents
>> change just after a mount cycle.
>>
>> There is a much smaller reproducer based on that test case, which I
>> mainly used to debug the bug:
>>
>> workload() {
>>          mkfs.btrfs -f $dev > /dev/null
>>          dmesg -C
>>          trace-cmd clear
>>          mount -o compress=zstd $dev $mnt
>>          xfs_io -f -c "pwrite -S 0xff 0 256K" -c "sync" $mnt/base > /dev/null
>>          cp --reflink=always -p -f $mnt/base $mnt/file
>>          $fsx -N 4 -d -k -S 3746842 $mnt/file
>>          if [ $? -ne 0 ]; then
>>                  echo "!!! FSX FAILURE !!!"
>>                  fail
>>          fi
>>          csum_before=$(_md5_checksum $mnt/file)
>>          stop_trace
>>          umount $mnt
>>          mount $dev $mnt
>>          csum_after=$(_md5_checksum $mnt/file)
>>          umount $mnt
>>          if [ "$csum_before" != "$csum_after" ]; then
>>                  echo "!!! CSUM MISMATCH !!!"
>>                  fail
>>          fi
>> }
>>
>> This seed value will cause 100% reproducible csum mismatch after a mount
>> cycle.
>>
>> The seed value results only 2 real operations:
>>
>>    Seed set to 3746842
>>    main: filesystem does not support fallocate mode FALLOC_FL_UNSHARE_RANGE, disabling!
>>    main: filesystem does not support fallocate mode FALLOC_FL_COLLAPSE_RANGE, disabling!
>>    main: filesystem does not support fallocate mode FALLOC_FL_INSERT_RANGE, disabling!
>>    main: filesystem does not support exchange range, disabling!
>>    main: filesystem does not support dontcache IO, disabling!
>>    2 clone       from 0x3b000 to 0x3f000, (0x4000 bytes) at 0x1f000
>>    3 write       0x2975b thru    0x2ba20 (0x22c6 bytes)  dontcache=0
>>    All 4 operations completed A-OK!
> 
> Can you please make a test case for fstests?
> Without fsx of course, since when new operations are added or fsx is
> changed in other ways, the same seed is likely to not exercise the bug
> anymore.
> 
> Just using xfs_io (writes, cloning, etc).

Sure.

Although by the nature of page reads, on x86_64 it will mostly not 
reproduce and only trigger on bs < ps cases.

[...]
>> This means, we reads of range [124K, 140K) and [140K, 165K) should not
> 
> "we reads of range..." -> the reads of ranges...
> 
>> be merged.
>>
>> But read merge check function, btrfs_bio_is_contig(), is only checking
>> the disk_bytenr of two compressed reads, as there are not enough info
>> like the involved extent maps to do more comprehensive checks, resulting
>> the incorrect compressed read.
> 
> So this is a variant of similar problems solved in the past where for
> compressed extents we merged when we shouldn't have:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=005efedf2c7d0a270ffbe28d8997b03844f3e3e7
> http://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=808f80b46790f27e145c72112189d6a3be2bc884
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8e928218780e2f1cf2f5891c7575e8f0b284fcce

Exactly.

> And we have test cases btrfs/103, btrfs/106 and btrfs/183 that
> exercise those past pugs.
> 
>>
>> Unfortunately this is a long existing bug, way before subpage block size
>> support.
>>
>> But subpage block size support (and experimental large folio support)
>> makes it much easier to detect.
>>
>> If block size equals page size, regular page read will only read one
>> block each time, thus no extent map sharing nor merge.
>>
>> (This means for bs == ps cases, it's still possible to hit the bug with
>> readahead, just we don't have test coverage with content verification
>> for readahead)
>>
>> [FIX]
>> Save the last hit compressed extent map start/len into btrfs_bio_ctrl,
>> and check if the current extent map is the same as the saved one.
>>
>> Here we only save em::start/len to save memory for btrfs_bio_ctrl, as
>> it's using the stack memory, which is a very limited resource inside the
>> kernel.
>>
>> Since the compressed extent maps are never merged, their start/len are
>> unique inside the same inode, thus just checking start/len will be
>> enough to make sure they are the same extent map.
>>
>> If the extent maps do not match, force submitting the current bio, so
>> that the read will never be merged.
>>
>> CC: stable@vger.kernel.org
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>> v2:
>> - Only save extent_map::start/len to save memory for btrfs_bio_ctrl
>>    It's using on-stack memory which is very limited inside the kernel.
>>
>> - Remove the commit message mentioning of clearing last saved em
>>    Since we're using em::start/len, there is no need to clear them.
>>    Either we hit the same em::start/len, meaning hitting the same extent
>>    map, or we hit a different em, which will have a different start/len.
>> ---
>>   fs/btrfs/extent_io.c | 52 ++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 52 insertions(+)
>>
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index 0c12fd64a1f3..418e3bf40f94 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -131,6 +131,22 @@ struct btrfs_bio_ctrl {
>>           */
>>          unsigned long submit_bitmap;
>>          struct readahead_control *ractl;
>> +
>> +       /*
>> +        * The start/len of the last hit compressed extent map.
>> +        *
>> +        * The current btrfs_bio_is_contig() only uses disk_bytenr as
>> +        * the condition to check if the read can be merged with previous
>> +        * bio, which is not correct. E.g. two file extents pointing to the
>> +        * same extent.
>> +        *
>> +        * So here we need to do extra check to merge reads that are
>> +        * covered by the same extent map.
>> +        * Just extent_map::start/len will be enough, as they are unique
>> +        * inside the same inode.
>> +        */
>> +       u64 last_compress_em_start;
>> +       u64 last_compress_em_len;
>>   };
>>
>>   /*
>> @@ -957,6 +973,32 @@ static void btrfs_readahead_expand(struct readahead_control *ractl,
>>                  readahead_expand(ractl, ra_pos, em_end - ra_pos);
>>   }
>>
>> +static void save_compressed_em(struct btrfs_bio_ctrl *bio_ctrl,
>> +                              const struct extent_map *em)
>> +{
>> +       if (btrfs_extent_map_compression(em) == BTRFS_COMPRESS_NONE)
>> +               return;
>> +       bio_ctrl->last_compress_em_start = em->start;
>> +       bio_ctrl->last_compress_em_len = em->len;
>> +}
> 
> Something so simple can be open coded instead in the single caller...
> 
>> +
>> +static bool is_same_compressed_em(struct btrfs_bio_ctrl *bio_ctrl,
>> +                                 const struct extent_map *em)
>> +{
>> +       /*
>> +        * Only if the em is completely the same as the previous one we can merge
>> +        * the current folio in the read bio.
>> +        *
>> +        * Here we only need to compare the em->start/len against saved
>> +        * last_compress_em_start/len, as start/len inside an inode are unique,
>> +        * and compressed extent maps are never merged.
>> +        */
>> +       if (em->start != bio_ctrl->last_compress_em_start ||
>> +           em->len != bio_ctrl->last_compress_em_len)
>> +               return false;
> 
> Same here. In fact we already have part of this logic in the caller, see below.
> 
> 
>> +       return true;
>> +}
>> +
>>   /*
>>    * basic readpage implementation.  Locked extent state structs are inserted
>>    * into the tree that are removed when the IO is done (by the end_io
>> @@ -1080,9 +1122,19 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
>>                      *prev_em_start != em->start)
>>                          force_bio_submit = true;
>>
>> +               /*
>> +                * We must ensure we only merge compressed read when the current
>> +                * extent map matches the previous one exactly.
>> +                */
>> +               if (compress_type != BTRFS_COMPRESS_NONE) {
>> +                       if (!is_same_compressed_em(bio_ctrl, em))
>> +                               force_bio_submit = true;
>> +               }
> 
> We already do almost all of this above - we only miss the extent map
> length check.
> We have the prev_em_start which already stores the start offset of the
> last found compressed extent map, so we're duplicating code and logic
> here unnecessarily.
> 
> Further with this new logic, since last_compress_em_start and
> last_compress_em_len are initialized to zero, we always do an
> unnecessary split for the first readahead call that spans more than
> page/folio.
> We need to distinguish the first call and ignore it - that's why
> *prev_em_start is initialized to (u64)-1 and we check that special
> value above.
> 
> 
> 
>> +
>>                  if (prev_em_start)
>>                          *prev_em_start = em->start;
>>
>> +               save_compressed_em(bio_ctrl, em);
> 
> This could have been placed under the check for compress_type !=
> BTRFS_COMPRESS_NONE...
> 
> In other words, this could be simplified to this:
> 
> (also at https://pastebin.com/raw/tZHq7icd)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 0c12fd64a1f3..a982277f852b 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -131,6 +131,22 @@ struct btrfs_bio_ctrl {
>    */
>    unsigned long submit_bitmap;
>    struct readahead_control *ractl;
> +
> + /*
> + * The start/len of the last hit compressed extent map.
> + *
> + * The current btrfs_bio_is_contig() only uses disk_bytenr as
> + * the condition to check if the read can be merged with previous
> + * bio, which is not correct. E.g. two file extents pointing to the
> + * same extent.
> + *
> + * So here we need to do extra check to merge reads that are
> + * covered by the same extent map.
> + * Just extent_map::start/len will be enough, as they are unique
> + * inside the same inode.
> + */
> + u64 last_compress_em_start;
> + u64 last_compress_em_len;
>   };
> 
>   /*
> @@ -965,7 +981,7 @@ static void btrfs_readahead_expand(struct
> readahead_control *ractl,
>    * return 0 on success, otherwise return error
>    */
>   static int btrfs_do_readpage(struct folio *folio, struct extent_map
> **em_cached,
> -      struct btrfs_bio_ctrl *bio_ctrl, u64 *prev_em_start)
> +     struct btrfs_bio_ctrl *bio_ctrl)
>   {
>    struct inode *inode = folio->mapping->host;
>    struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
> @@ -1075,13 +1091,15 @@ static int btrfs_do_readpage(struct folio
> *folio, struct extent_map **em_cached,
>    * is a corner case so we prioritize correctness over
>    * non-optimal behavior (submitting 2 bios for the same extent).
>    */
> - if (compress_type != BTRFS_COMPRESS_NONE &&
> -    prev_em_start && *prev_em_start != (u64)-1 &&
> -    *prev_em_start != em->start)
> - force_bio_submit = true;
> -
> - if (prev_em_start)
> - *prev_em_start = em->start;
> + if (compress_type != BTRFS_COMPRESS_NONE) {
> + if (bio_ctrl->last_compress_em_start != U64_MAX &&

Since we have last_compress_em_len and it's initialized to 0, we can use 
that em_len instead of using U64_MAX for em_start.

Since we should never hit an em with 0 length.

> +    (em->start != bio_ctrl->last_compress_em_start ||
> +     em->len != bio_ctrl->last_compress_em_len))
> + force_bio_submit = true;
> +
> + bio_ctrl->last_compress_em_start = em->start;
> + bio_ctrl->last_compress_em_len = em->len;
> + }
> 
>    em_gen = em->generation;
>    btrfs_free_extent_map(em);
> @@ -1296,12 +1314,15 @@ int btrfs_read_folio(struct file *file, struct
> folio *folio)
>    const u64 start = folio_pos(folio);
>    const u64 end = start + folio_size(folio) - 1;
>    struct extent_state *cached_state = NULL;
> - struct btrfs_bio_ctrl bio_ctrl = { .opf = REQ_OP_READ };
> + struct btrfs_bio_ctrl bio_ctrl = {
> + .opf = REQ_OP_READ,
> + .last_compress_em_start = U64_MAX,
> + };
>    struct extent_map *em_cached = NULL;
>    int ret;
> 
>    lock_extents_for_read(inode, start, end, &cached_state);
> - ret = btrfs_do_readpage(folio, &em_cached, &bio_ctrl, NULL);
> + ret = btrfs_do_readpage(folio, &em_cached, &bio_ctrl);
>    btrfs_unlock_extent(&inode->io_tree, start, end, &cached_state);
> 
>    btrfs_free_extent_map(em_cached);
> @@ -2641,7 +2662,8 @@ void btrfs_readahead(struct readahead_control *rac)
>   {
>    struct btrfs_bio_ctrl bio_ctrl = {
>    .opf = REQ_OP_READ | REQ_RAHEAD,
> - .ractl = rac
> + .ractl = rac,
> + .last_compress_em_start = U64_MAX,
>    };
>    struct folio *folio;
>    struct btrfs_inode *inode = BTRFS_I(rac->mapping->host);
> @@ -2649,12 +2671,11 @@ void btrfs_readahead(struct readahead_control *rac)
>    const u64 end = start + readahead_length(rac) - 1;
>    struct extent_state *cached_state = NULL;
>    struct extent_map *em_cached = NULL;
> - u64 prev_em_start = (u64)-1;
> 
>    lock_extents_for_read(inode, start, end, &cached_state);
> 
>    while ((folio = readahead_folio(rac)) != NULL)
> - btrfs_do_readpage(folio, &em_cached, &bio_ctrl, &prev_em_start);
> + btrfs_do_readpage(folio, &em_cached, &bio_ctrl);
> 
>    btrfs_unlock_extent(&inode->io_tree, start, end, &cached_state);

Thanks a lot for the simplified version, will update the patch with this 
change and a more simplified reproducer.

Thanks,
Qu

> 
> Thanks.
>>                  em_gen = em->generation;
>>                  btrfs_free_extent_map(em);
>>                  em = NULL;
>> --
>> 2.50.1
>>
>>


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2025-08-25 22:23 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-23 23:24 [PATCH v2] btrfs: do more strict compressed read merge check Qu Wenruo
2025-08-25 10:42 ` Filipe Manana
2025-08-25 22:22   ` Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).