* [PATCH v3] btrfs: do more strict compressed read merge check
@ 2025-08-26 9:56 Qu Wenruo
2025-08-26 11:01 ` Filipe Manana
0 siblings, 1 reply; 3+ messages in thread
From: Qu Wenruo @ 2025-08-26 9:56 UTC (permalink / raw)
To: linux-btrfs; +Cc: stable
[BUG]
With 64K page size (aarch64 with 64K page size config) and 4K btrfs
block size, the following workload can easily lead to a corrupted read:
mkfs.btrfs -f -s 4k $dev > /dev/null
mount -o compress=zstd $dev $mnt
xfs_io -f -c "pwrite -S 0xff 0 64k" -c sync $mnt/base > /dev/null
echo "correct result:"
od -Ax -x $mnt/base
xfs_io -f -c "reflink $mnt/base 32k 0 32k" \
-c "reflink $mnt/base 0 32k 32k" \
-c "pwrite -S 0xff 60k 4k" $mnt/new > /dev/null
echo "incorrect result:"
od -Ax -x $mnt/new
umount $mnt
This shows the following result:
correct result:
000000 ffff ffff ffff ffff ffff ffff ffff ffff
*
010000
incorrect result:
000000 ffff ffff ffff ffff ffff ffff ffff ffff
*
008000 0000 0000 0000 0000 0000 0000 0000 0000
*
00f000 ffff ffff ffff ffff ffff ffff ffff ffff
*
010000
Notice the zero in the range [0x8000, 0xf000), which is incorrect.
[CAUSE]
With extra trace printk, it shows the following events during od:
(some unrelated info removed like CPU and context)
od-3457 btrfs_do_readpage: enter r/i=5/258 folio=0(65536) prev_em_start=0000000000000000
The "r/i" is indicating the root and inode number. In our case the file
"new" is using ino 258 from fs tree (root 5).
Here notice the @prev_em_start pointer is NULL. This means the
btrfs_do_readpage() is called from btrfs_read_folio(), not from
btrfs_readahead().
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=0 got em start=0 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=4096 got em start=0 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=8192 got em start=0 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=12288 got em start=0 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=16384 got em start=0 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=20480 got em start=0 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=24576 got em start=0 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=28672 got em start=0 len=32768
These above 32K blocks will be read from the the first half of the
compressed data extent.
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=32768 got em start=32768 len=32768
Note here there is no btrfs_submit_compressed_read() call. Which is
incorrect now.
As both extent maps at 0 and 32K are pointing to the same compressed
data, their offset are different thus can not be merged into the same
read.
So this means the compressed data read merge check is doing something
wrong.
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=36864 got em start=32768 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=40960 got em start=32768 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=45056 got em start=32768 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=49152 got em start=32768 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=53248 got em start=32768 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=57344 got em start=32768 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=61440 skip uptodate
od-3457 btrfs_submit_compressed_read: cb orig_bio: file off=0 len=61440
The function btrfs_submit_compressed_read() is only called at the end of
folio read. The compressed bio will only have a extent map of range [0,
32K), but the original bio passed in are for the whole 64K folio.
This will cause the decompression part to only fill the first 32K,
leaving the rest untouched (aka, filled with zero).
This incorrect compressed read merge leads to the above data corruption.
There are similar problems happened in the past, commit 808f80b46790
("Btrfs: update fix for read corruption of compressed and shared
extents") is doing pretty much the same fix for readahead.
But that's back to 2015, where btrfs still only supports bs (block size)
== ps (page size) cases.
This means btrfs_do_readpage() only needs to handle a folio which
contains exactly one block.
Only btrfs_readahead() can lead to a read covering multiple blocks.
Thus only btrfs_readahead() passes a non-NULL @prev_em_start pointer.
With the v5.15 btrfs introduced bs < ps support. This breaks the above
assumption that a folio can only contain one block.
Now btrfs_read_folio() can also read multiple blocks in one go.
But btrfs_read_folio() doesn't pass a @prev_em_start pointer, thus the
existing bio force submission check will never be triggered.
In theory, this can also happen for btrfs with large folios, but since
large folio is still experimental, we don't need to bother it, thus only
bs < ps support is affected for now.
[FIX]
Instead of passing @prev_em_start to do the proper compressed extent
check, introduce one new member, btrfs_bio_ctrl::last_em_start, so that
the existing bio force submission logic will always be triggered.
CC: stable@vger.kernel.org #5.15+
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
Changelog:
v3:
- Use a single btrfs_bio_ctrl::last_em_start
Since the existing prev_em_start is using U64_MAX as the initial value
to indicate no em hit yet, we can use the same logic, which saves
another 8 bytes from btrfs_bio_ctrl.
- Update the reproducer
Previously I failed to reproduce using a minimal workload.
As regular read from od/md5sum always trigger readahead thus not
hitting the @prev_em_start == NULL path.
Will send out a fstest case for it using the minimal reproducer.
But it will still need a 16K/64K page sized system to reproduce.
Fix the problem by doing an block aligned write into the folio, so
that the folio will be partially dirty and not go through the
readahead path.
- Update the analyze
This includes the trace events of the minimal reproducer, and
mentioning of previous similar fixes and why they do not work for
subpage cases.
- Update the CC tag
Since it's only affecting bs < ps cases (for non-experimental builds),
only need to fix kernels with subpage btrfs supports.
v2:
- Only save extent_map::start/len to save memory for btrfs_bio_ctrl
It's using on-stack memory which is very limited inside the kernel.
- Remove the commit message mentioning of clearing last saved em
Since we're using em::start/len, there is no need to clear them.
Either we hit the same em::start/len, meaning hitting the same extent
map, or we hit a different em, which will have a different start/len.
---
fs/btrfs/extent_io.c | 40 ++++++++++++++++++++++++++++++----------
1 file changed, 30 insertions(+), 10 deletions(-)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 0c12fd64a1f3..b219dbaaedc8 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -131,6 +131,24 @@ struct btrfs_bio_ctrl {
*/
unsigned long submit_bitmap;
struct readahead_control *ractl;
+
+ /*
+ * The start file offset of the last hit extent map.
+ *
+ * This is for proper compressed read merge.
+ * U64_MAX means no extent map hit yet.
+ *
+ * The current btrfs_bio_is_contig() only uses disk_bytenr as
+ * the condition to check if the read can be merged with previous
+ * bio, which is not correct. E.g. two file extents pointing to the
+ * same extent but with different offset.
+ *
+ * So here we need to do extra check to only merge reads that are
+ * covered by the same extent map.
+ * Just extent_map::start will be enough, as they are unique
+ * inside the same inode.
+ */
+ u64 last_em_start;
};
/*
@@ -965,7 +983,7 @@ static void btrfs_readahead_expand(struct readahead_control *ractl,
* return 0 on success, otherwise return error
*/
static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
- struct btrfs_bio_ctrl *bio_ctrl, u64 *prev_em_start)
+ struct btrfs_bio_ctrl *bio_ctrl)
{
struct inode *inode = folio->mapping->host;
struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
@@ -1076,12 +1094,11 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
* non-optimal behavior (submitting 2 bios for the same extent).
*/
if (compress_type != BTRFS_COMPRESS_NONE &&
- prev_em_start && *prev_em_start != (u64)-1 &&
- *prev_em_start != em->start)
+ bio_ctrl->last_em_start != U64_MAX &&
+ bio_ctrl->last_em_start != em->start)
force_bio_submit = true;
- if (prev_em_start)
- *prev_em_start = em->start;
+ bio_ctrl->last_em_start = em->start;
em_gen = em->generation;
btrfs_free_extent_map(em);
@@ -1296,12 +1313,15 @@ int btrfs_read_folio(struct file *file, struct folio *folio)
const u64 start = folio_pos(folio);
const u64 end = start + folio_size(folio) - 1;
struct extent_state *cached_state = NULL;
- struct btrfs_bio_ctrl bio_ctrl = { .opf = REQ_OP_READ };
+ struct btrfs_bio_ctrl bio_ctrl = {
+ .opf = REQ_OP_READ,
+ .last_em_start = U64_MAX,
+ };
struct extent_map *em_cached = NULL;
int ret;
lock_extents_for_read(inode, start, end, &cached_state);
- ret = btrfs_do_readpage(folio, &em_cached, &bio_ctrl, NULL);
+ ret = btrfs_do_readpage(folio, &em_cached, &bio_ctrl);
btrfs_unlock_extent(&inode->io_tree, start, end, &cached_state);
btrfs_free_extent_map(em_cached);
@@ -2641,7 +2661,8 @@ void btrfs_readahead(struct readahead_control *rac)
{
struct btrfs_bio_ctrl bio_ctrl = {
.opf = REQ_OP_READ | REQ_RAHEAD,
- .ractl = rac
+ .ractl = rac,
+ .last_em_start = U64_MAX,
};
struct folio *folio;
struct btrfs_inode *inode = BTRFS_I(rac->mapping->host);
@@ -2649,12 +2670,11 @@ void btrfs_readahead(struct readahead_control *rac)
const u64 end = start + readahead_length(rac) - 1;
struct extent_state *cached_state = NULL;
struct extent_map *em_cached = NULL;
- u64 prev_em_start = (u64)-1;
lock_extents_for_read(inode, start, end, &cached_state);
while ((folio = readahead_folio(rac)) != NULL)
- btrfs_do_readpage(folio, &em_cached, &bio_ctrl, &prev_em_start);
+ btrfs_do_readpage(folio, &em_cached, &bio_ctrl);
btrfs_unlock_extent(&inode->io_tree, start, end, &cached_state);
--
2.50.1
^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: [PATCH v3] btrfs: do more strict compressed read merge check
2025-08-26 9:56 [PATCH v3] btrfs: do more strict compressed read merge check Qu Wenruo
@ 2025-08-26 11:01 ` Filipe Manana
2025-08-26 11:13 ` Filipe Manana
0 siblings, 1 reply; 3+ messages in thread
From: Filipe Manana @ 2025-08-26 11:01 UTC (permalink / raw)
To: Qu Wenruo; +Cc: linux-btrfs, stable
On Tue, Aug 26, 2025 at 11:34 AM Qu Wenruo <wqu@suse.com> wrote:
>
> [BUG]
> With 64K page size (aarch64 with 64K page size config) and 4K btrfs
> block size, the following workload can easily lead to a corrupted read:
>
> mkfs.btrfs -f -s 4k $dev > /dev/null
> mount -o compress=zstd $dev $mnt
> xfs_io -f -c "pwrite -S 0xff 0 64k" -c sync $mnt/base > /dev/null
You can remove the sync, it's unnecessary as the reflinks will sync
the file. Leaving the sync is only distracting as it's not needed to trigger
the bug.
Keeping things minimal without unnecessary steps makes them easier to
understand and avoid any confusion.
You can also remove -o compress=zstd, and just leave -o compress, the
compression algorithm is irrelevant.
> echo "correct result:"
> od -Ax -x $mnt/base
od -A d -t x1 $mnt/base
That will print the offsets in decimal, far easier to read and
interpret rather than having to convert to decimal when reading...
> xfs_io -f -c "reflink $mnt/base 32k 0 32k" \
> -c "reflink $mnt/base 0 32k 32k" \
> -c "pwrite -S 0xff 60k 4k" $mnt/new > /dev/null
> echo "incorrect result:"
> od -Ax -x $mnt/new
> umount $mnt
>
> This shows the following result:
>
> correct result:
> 000000 ffff ffff ffff ffff ffff ffff ffff ffff
> *
> 010000
> incorrect result:
> 000000 ffff ffff ffff ffff ffff ffff ffff ffff
> *
> 008000 0000 0000 0000 0000 0000 0000 0000 0000
You see, reading 32768 is far easier to understand than reading
008000... Same goes for the other offsets in hexadecimal.
> *
> 00f000 ffff ffff ffff ffff ffff ffff ffff ffff
> *
> 010000
>
> Notice the zero in the range [0x8000, 0xf000), which is incorrect.
>
> [CAUSE]
> With extra trace printk, it shows the following events during od:
> (some unrelated info removed like CPU and context)
>
> od-3457 btrfs_do_readpage: enter r/i=5/258 folio=0(65536) prev_em_start=0000000000000000
>
> The "r/i" is indicating the root and inode number. In our case the file
> "new" is using ino 258 from fs tree (root 5).
>
> Here notice the @prev_em_start pointer is NULL. This means the
> btrfs_do_readpage() is called from btrfs_read_folio(), not from
> btrfs_readahead().
>
> od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=0 got em start=0 len=32768
> od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=4096 got em start=0 len=32768
> od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=8192 got em start=0 len=32768
> od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=12288 got em start=0 len=32768
> od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=16384 got em start=0 len=32768
> od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=20480 got em start=0 len=32768
> od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=24576 got em start=0 len=32768
> od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=28672 got em start=0 len=32768
>
> These above 32K blocks will be read from the the first half of the
the the -> the
> compressed data extent.
>
> od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=32768 got em start=32768 len=32768
>
> Note here there is no btrfs_submit_compressed_read() call. Which is
> incorrect now.
> As both extent maps at 0 and 32K are pointing to the same compressed
> data, their offset are different thus can not be merged into the same
offset -> offsets
> read.
>
> So this means the compressed data read merge check is doing something
> wrong.
>
> od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=36864 got em start=32768 len=32768
> od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=40960 got em start=32768 len=32768
> od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=45056 got em start=32768 len=32768
> od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=49152 got em start=32768 len=32768
> od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=53248 got em start=32768 len=32768
> od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=57344 got em start=32768 len=32768
> od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=61440 skip uptodate
> od-3457 btrfs_submit_compressed_read: cb orig_bio: file off=0 len=61440
>
> The function btrfs_submit_compressed_read() is only called at the end of
> folio read. The compressed bio will only have a extent map of range [0,
a extent -> an extent
> 32K), but the original bio passed in are for the whole 64K folio.
are -> is
>
> This will cause the decompression part to only fill the first 32K,
> leaving the rest untouched (aka, filled with zero).
>
> This incorrect compressed read merge leads to the above data corruption.
>
> There are similar problems happened in the past, commit 808f80b46790
There were similar problems that happened in the past (...)
> ("Btrfs: update fix for read corruption of compressed and shared
> extents") is doing pretty much the same fix for readahead.
>
> But that's back to 2015, where btrfs still only supports bs (block size)
> == ps (page size) cases.
> This means btrfs_do_readpage() only needs to handle a folio which
> contains exactly one block.
>
> Only btrfs_readahead() can lead to a read covering multiple blocks.
> Thus only btrfs_readahead() passes a non-NULL @prev_em_start pointer.
>
> With the v5.15 btrfs introduced bs < ps support. This breaks the above
> assumption that a folio can only contain one block.
>
> Now btrfs_read_folio() can also read multiple blocks in one go.
> But btrfs_read_folio() doesn't pass a @prev_em_start pointer, thus the
> existing bio force submission check will never be triggered.
>
> In theory, this can also happen for btrfs with large folios, but since
> large folio is still experimental, we don't need to bother it, thus only
> bs < ps support is affected for now.
>
> [FIX]
> Instead of passing @prev_em_start to do the proper compressed extent
> check, introduce one new member, btrfs_bio_ctrl::last_em_start, so that
> the existing bio force submission logic will always be triggered.
>
> CC: stable@vger.kernel.org #5.15+
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> Changelog:
> v3:
> - Use a single btrfs_bio_ctrl::last_em_start
> Since the existing prev_em_start is using U64_MAX as the initial value
> to indicate no em hit yet, we can use the same logic, which saves
> another 8 bytes from btrfs_bio_ctrl.
>
> - Update the reproducer
> Previously I failed to reproduce using a minimal workload.
> As regular read from od/md5sum always trigger readahead thus not
> hitting the @prev_em_start == NULL path.
>
> Will send out a fstest case for it using the minimal reproducer.
> But it will still need a 16K/64K page sized system to reproduce.
>
> Fix the problem by doing an block aligned write into the folio, so
> that the folio will be partially dirty and not go through the
> readahead path.
>
> - Update the analyze
> This includes the trace events of the minimal reproducer, and
> mentioning of previous similar fixes and why they do not work for
> subpage cases.
>
> - Update the CC tag
> Since it's only affecting bs < ps cases (for non-experimental builds),
> only need to fix kernels with subpage btrfs supports.
>
> v2:
> - Only save extent_map::start/len to save memory for btrfs_bio_ctrl
> It's using on-stack memory which is very limited inside the kernel.
>
> - Remove the commit message mentioning of clearing last saved em
> Since we're using em::start/len, there is no need to clear them.
> Either we hit the same em::start/len, meaning hitting the same extent
> map, or we hit a different em, which will have a different start/len.
> ---
> fs/btrfs/extent_io.c | 40 ++++++++++++++++++++++++++++++----------
> 1 file changed, 30 insertions(+), 10 deletions(-)
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 0c12fd64a1f3..b219dbaaedc8 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -131,6 +131,24 @@ struct btrfs_bio_ctrl {
> */
> unsigned long submit_bitmap;
> struct readahead_control *ractl;
> +
> + /*
> + * The start file offset of the last hit extent map.
A bit confusing, something more clear for example: The start offset
of the last used extent map by a read operation.
> + *
> + * This is for proper compressed read merge.
> + * U64_MAX means no extent map hit yet.
That sounds a bit confusing too.
Something clearer: U64_MAX means we are starting the read and have
made no progress yet.
> + *
> + * The current btrfs_bio_is_contig() only uses disk_bytenr as
> + * the condition to check if the read can be merged with previous
> + * bio, which is not correct. E.g. two file extents pointing to the
> + * same extent but with different offset.
> + *
> + * So here we need to do extra check to only merge reads that are
check -> checks
This now looks much better than before and without duplicating code.
With those small changes:
Reviewed-by: Filipe Manana <fdmanana@suse.com>
> + * covered by the same extent map.
> + * Just extent_map::start will be enough, as they are unique
> + * inside the same inode.
> + */
> + u64 last_em_start;
> };
>
> /*
> @@ -965,7 +983,7 @@ static void btrfs_readahead_expand(struct readahead_control *ractl,
> * return 0 on success, otherwise return error
> */
> static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
> - struct btrfs_bio_ctrl *bio_ctrl, u64 *prev_em_start)
> + struct btrfs_bio_ctrl *bio_ctrl)
> {
> struct inode *inode = folio->mapping->host;
> struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
> @@ -1076,12 +1094,11 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
> * non-optimal behavior (submitting 2 bios for the same extent).
> */
> if (compress_type != BTRFS_COMPRESS_NONE &&
> - prev_em_start && *prev_em_start != (u64)-1 &&
> - *prev_em_start != em->start)
> + bio_ctrl->last_em_start != U64_MAX &&
> + bio_ctrl->last_em_start != em->start)
> force_bio_submit = true;
>
> - if (prev_em_start)
> - *prev_em_start = em->start;
> + bio_ctrl->last_em_start = em->start;
>
> em_gen = em->generation;
> btrfs_free_extent_map(em);
> @@ -1296,12 +1313,15 @@ int btrfs_read_folio(struct file *file, struct folio *folio)
> const u64 start = folio_pos(folio);
> const u64 end = start + folio_size(folio) - 1;
> struct extent_state *cached_state = NULL;
> - struct btrfs_bio_ctrl bio_ctrl = { .opf = REQ_OP_READ };
> + struct btrfs_bio_ctrl bio_ctrl = {
> + .opf = REQ_OP_READ,
> + .last_em_start = U64_MAX,
> + };
> struct extent_map *em_cached = NULL;
> int ret;
>
> lock_extents_for_read(inode, start, end, &cached_state);
> - ret = btrfs_do_readpage(folio, &em_cached, &bio_ctrl, NULL);
> + ret = btrfs_do_readpage(folio, &em_cached, &bio_ctrl);
> btrfs_unlock_extent(&inode->io_tree, start, end, &cached_state);
>
> btrfs_free_extent_map(em_cached);
> @@ -2641,7 +2661,8 @@ void btrfs_readahead(struct readahead_control *rac)
> {
> struct btrfs_bio_ctrl bio_ctrl = {
> .opf = REQ_OP_READ | REQ_RAHEAD,
> - .ractl = rac
> + .ractl = rac,
> + .last_em_start = U64_MAX,
> };
> struct folio *folio;
> struct btrfs_inode *inode = BTRFS_I(rac->mapping->host);
> @@ -2649,12 +2670,11 @@ void btrfs_readahead(struct readahead_control *rac)
> const u64 end = start + readahead_length(rac) - 1;
> struct extent_state *cached_state = NULL;
> struct extent_map *em_cached = NULL;
> - u64 prev_em_start = (u64)-1;
>
> lock_extents_for_read(inode, start, end, &cached_state);
>
> while ((folio = readahead_folio(rac)) != NULL)
> - btrfs_do_readpage(folio, &em_cached, &bio_ctrl, &prev_em_start);
> + btrfs_do_readpage(folio, &em_cached, &bio_ctrl);
>
> btrfs_unlock_extent(&inode->io_tree, start, end, &cached_state);
>
> --
> 2.50.1
>
>
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH v3] btrfs: do more strict compressed read merge check
2025-08-26 11:01 ` Filipe Manana
@ 2025-08-26 11:13 ` Filipe Manana
0 siblings, 0 replies; 3+ messages in thread
From: Filipe Manana @ 2025-08-26 11:13 UTC (permalink / raw)
To: Qu Wenruo; +Cc: linux-btrfs, stable
On Tue, Aug 26, 2025 at 12:01 PM Filipe Manana <fdmanana@kernel.org> wrote:
>
> On Tue, Aug 26, 2025 at 11:34 AM Qu Wenruo <wqu@suse.com> wrote:
> >
> > [BUG]
> > With 64K page size (aarch64 with 64K page size config) and 4K btrfs
> > block size, the following workload can easily lead to a corrupted read:
> >
> > mkfs.btrfs -f -s 4k $dev > /dev/null
> > mount -o compress=zstd $dev $mnt
> > xfs_io -f -c "pwrite -S 0xff 0 64k" -c sync $mnt/base > /dev/null
>
> You can remove the sync, it's unnecessary as the reflinks will sync
> the file. Leaving the sync is only distracting as it's not needed to trigger
> the bug.
>
> Keeping things minimal without unnecessary steps makes them easier to
> understand and avoid any confusion.
>
> You can also remove -o compress=zstd, and just leave -o compress, the
> compression algorithm is irrelevant.
>
> > echo "correct result:"
> > od -Ax -x $mnt/base
>
> od -A d -t x1 $mnt/base
>
> That will print the offsets in decimal, far easier to read and
> interpret rather than having to convert to decimal when reading...
>
>
> > xfs_io -f -c "reflink $mnt/base 32k 0 32k" \
> > -c "reflink $mnt/base 0 32k 32k" \
> > -c "pwrite -S 0xff 60k 4k" $mnt/new > /dev/null
> > echo "incorrect result:"
> > od -Ax -x $mnt/new
> > umount $mnt
> >
> > This shows the following result:
> >
> > correct result:
> > 000000 ffff ffff ffff ffff ffff ffff ffff ffff
> > *
> > 010000
> > incorrect result:
> > 000000 ffff ffff ffff ffff ffff ffff ffff ffff
> > *
> > 008000 0000 0000 0000 0000 0000 0000 0000 0000
>
> You see, reading 32768 is far easier to understand than reading
> 008000... Same goes for the other offsets in hexadecimal.
>
> > *
> > 00f000 ffff ffff ffff ffff ffff ffff ffff ffff
> > *
> > 010000
> >
> > Notice the zero in the range [0x8000, 0xf000), which is incorrect.
> >
> > [CAUSE]
> > With extra trace printk, it shows the following events during od:
> > (some unrelated info removed like CPU and context)
> >
> > od-3457 btrfs_do_readpage: enter r/i=5/258 folio=0(65536) prev_em_start=0000000000000000
> >
> > The "r/i" is indicating the root and inode number. In our case the file
> > "new" is using ino 258 from fs tree (root 5).
> >
> > Here notice the @prev_em_start pointer is NULL. This means the
> > btrfs_do_readpage() is called from btrfs_read_folio(), not from
> > btrfs_readahead().
> >
> > od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=0 got em start=0 len=32768
> > od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=4096 got em start=0 len=32768
> > od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=8192 got em start=0 len=32768
> > od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=12288 got em start=0 len=32768
> > od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=16384 got em start=0 len=32768
> > od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=20480 got em start=0 len=32768
> > od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=24576 got em start=0 len=32768
> > od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=28672 got em start=0 len=32768
> >
> > These above 32K blocks will be read from the the first half of the
>
> the the -> the
>
> > compressed data extent.
> >
> > od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=32768 got em start=32768 len=32768
> >
> > Note here there is no btrfs_submit_compressed_read() call. Which is
> > incorrect now.
> > As both extent maps at 0 and 32K are pointing to the same compressed
> > data, their offset are different thus can not be merged into the same
>
> offset -> offsets
>
> > read.
> >
> > So this means the compressed data read merge check is doing something
> > wrong.
> >
> > od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=36864 got em start=32768 len=32768
> > od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=40960 got em start=32768 len=32768
> > od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=45056 got em start=32768 len=32768
> > od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=49152 got em start=32768 len=32768
> > od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=53248 got em start=32768 len=32768
> > od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=57344 got em start=32768 len=32768
> > od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=61440 skip uptodate
> > od-3457 btrfs_submit_compressed_read: cb orig_bio: file off=0 len=61440
> >
> > The function btrfs_submit_compressed_read() is only called at the end of
> > folio read. The compressed bio will only have a extent map of range [0,
>
> a extent -> an extent
>
> > 32K), but the original bio passed in are for the whole 64K folio.
>
> are -> is
>
> >
> > This will cause the decompression part to only fill the first 32K,
> > leaving the rest untouched (aka, filled with zero).
> >
> > This incorrect compressed read merge leads to the above data corruption.
> >
> > There are similar problems happened in the past, commit 808f80b46790
>
> There were similar problems that happened in the past (...)
>
> > ("Btrfs: update fix for read corruption of compressed and shared
> > extents") is doing pretty much the same fix for readahead.
> >
> > But that's back to 2015, where btrfs still only supports bs (block size)
> > == ps (page size) cases.
> > This means btrfs_do_readpage() only needs to handle a folio which
> > contains exactly one block.
> >
> > Only btrfs_readahead() can lead to a read covering multiple blocks.
> > Thus only btrfs_readahead() passes a non-NULL @prev_em_start pointer.
> >
> > With the v5.15 btrfs introduced bs < ps support. This breaks the above
> > assumption that a folio can only contain one block.
> >
> > Now btrfs_read_folio() can also read multiple blocks in one go.
> > But btrfs_read_folio() doesn't pass a @prev_em_start pointer, thus the
> > existing bio force submission check will never be triggered.
> >
> > In theory, this can also happen for btrfs with large folios, but since
> > large folio is still experimental, we don't need to bother it, thus only
> > bs < ps support is affected for now.
> >
> > [FIX]
> > Instead of passing @prev_em_start to do the proper compressed extent
> > check, introduce one new member, btrfs_bio_ctrl::last_em_start, so that
> > the existing bio force submission logic will always be triggered.
> >
> > CC: stable@vger.kernel.org #5.15+
> > Signed-off-by: Qu Wenruo <wqu@suse.com>
> > ---
> > Changelog:
> > v3:
> > - Use a single btrfs_bio_ctrl::last_em_start
> > Since the existing prev_em_start is using U64_MAX as the initial value
> > to indicate no em hit yet, we can use the same logic, which saves
> > another 8 bytes from btrfs_bio_ctrl.
> >
> > - Update the reproducer
> > Previously I failed to reproduce using a minimal workload.
> > As regular read from od/md5sum always trigger readahead thus not
> > hitting the @prev_em_start == NULL path.
> >
> > Will send out a fstest case for it using the minimal reproducer.
> > But it will still need a 16K/64K page sized system to reproduce.
> >
> > Fix the problem by doing an block aligned write into the folio, so
> > that the folio will be partially dirty and not go through the
> > readahead path.
> >
> > - Update the analyze
> > This includes the trace events of the minimal reproducer, and
> > mentioning of previous similar fixes and why they do not work for
> > subpage cases.
> >
> > - Update the CC tag
> > Since it's only affecting bs < ps cases (for non-experimental builds),
> > only need to fix kernels with subpage btrfs supports.
> >
> > v2:
> > - Only save extent_map::start/len to save memory for btrfs_bio_ctrl
> > It's using on-stack memory which is very limited inside the kernel.
> >
> > - Remove the commit message mentioning of clearing last saved em
> > Since we're using em::start/len, there is no need to clear them.
> > Either we hit the same em::start/len, meaning hitting the same extent
> > map, or we hit a different em, which will have a different start/len.
> > ---
> > fs/btrfs/extent_io.c | 40 ++++++++++++++++++++++++++++++----------
> > 1 file changed, 30 insertions(+), 10 deletions(-)
> >
> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > index 0c12fd64a1f3..b219dbaaedc8 100644
> > --- a/fs/btrfs/extent_io.c
> > +++ b/fs/btrfs/extent_io.c
> > @@ -131,6 +131,24 @@ struct btrfs_bio_ctrl {
> > */
> > unsigned long submit_bitmap;
> > struct readahead_control *ractl;
> > +
> > + /*
> > + * The start file offset of the last hit extent map.
>
> A bit confusing, something more clear for example: The start offset
> of the last used extent map by a read operation.
>
> > + *
> > + * This is for proper compressed read merge.
> > + * U64_MAX means no extent map hit yet.
>
> That sounds a bit confusing too.
> Something clearer: U64_MAX means we are starting the read and have
> made no progress yet.
>
> > + *
> > + * The current btrfs_bio_is_contig() only uses disk_bytenr as
> > + * the condition to check if the read can be merged with previous
> > + * bio, which is not correct. E.g. two file extents pointing to the
> > + * same extent but with different offset.
> > + *
> > + * So here we need to do extra check to only merge reads that are
>
> check -> checks
>
> This now looks much better than before and without duplicating code.
>
> With those small changes:
>
> Reviewed-by: Filipe Manana <fdmanana@suse.com>
We should also get a better subject.
The current one doesn't really indicate we are fixing a bug.
Something like:
btrfs: fix corruption reading compressed range when block size is
smaller than page size
And don't forget to update the test case too with any new subject.
Thanks.
>
>
> > + * covered by the same extent map.
> > + * Just extent_map::start will be enough, as they are unique
> > + * inside the same inode.
> > + */
> > + u64 last_em_start;
> > };
> >
> > /*
> > @@ -965,7 +983,7 @@ static void btrfs_readahead_expand(struct readahead_control *ractl,
> > * return 0 on success, otherwise return error
> > */
> > static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
> > - struct btrfs_bio_ctrl *bio_ctrl, u64 *prev_em_start)
> > + struct btrfs_bio_ctrl *bio_ctrl)
> > {
> > struct inode *inode = folio->mapping->host;
> > struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
> > @@ -1076,12 +1094,11 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
> > * non-optimal behavior (submitting 2 bios for the same extent).
> > */
> > if (compress_type != BTRFS_COMPRESS_NONE &&
> > - prev_em_start && *prev_em_start != (u64)-1 &&
> > - *prev_em_start != em->start)
> > + bio_ctrl->last_em_start != U64_MAX &&
> > + bio_ctrl->last_em_start != em->start)
> > force_bio_submit = true;
> >
> > - if (prev_em_start)
> > - *prev_em_start = em->start;
> > + bio_ctrl->last_em_start = em->start;
> >
> > em_gen = em->generation;
> > btrfs_free_extent_map(em);
> > @@ -1296,12 +1313,15 @@ int btrfs_read_folio(struct file *file, struct folio *folio)
> > const u64 start = folio_pos(folio);
> > const u64 end = start + folio_size(folio) - 1;
> > struct extent_state *cached_state = NULL;
> > - struct btrfs_bio_ctrl bio_ctrl = { .opf = REQ_OP_READ };
> > + struct btrfs_bio_ctrl bio_ctrl = {
> > + .opf = REQ_OP_READ,
> > + .last_em_start = U64_MAX,
> > + };
> > struct extent_map *em_cached = NULL;
> > int ret;
> >
> > lock_extents_for_read(inode, start, end, &cached_state);
> > - ret = btrfs_do_readpage(folio, &em_cached, &bio_ctrl, NULL);
> > + ret = btrfs_do_readpage(folio, &em_cached, &bio_ctrl);
> > btrfs_unlock_extent(&inode->io_tree, start, end, &cached_state);
> >
> > btrfs_free_extent_map(em_cached);
> > @@ -2641,7 +2661,8 @@ void btrfs_readahead(struct readahead_control *rac)
> > {
> > struct btrfs_bio_ctrl bio_ctrl = {
> > .opf = REQ_OP_READ | REQ_RAHEAD,
> > - .ractl = rac
> > + .ractl = rac,
> > + .last_em_start = U64_MAX,
> > };
> > struct folio *folio;
> > struct btrfs_inode *inode = BTRFS_I(rac->mapping->host);
> > @@ -2649,12 +2670,11 @@ void btrfs_readahead(struct readahead_control *rac)
> > const u64 end = start + readahead_length(rac) - 1;
> > struct extent_state *cached_state = NULL;
> > struct extent_map *em_cached = NULL;
> > - u64 prev_em_start = (u64)-1;
> >
> > lock_extents_for_read(inode, start, end, &cached_state);
> >
> > while ((folio = readahead_folio(rac)) != NULL)
> > - btrfs_do_readpage(folio, &em_cached, &bio_ctrl, &prev_em_start);
> > + btrfs_do_readpage(folio, &em_cached, &bio_ctrl);
> >
> > btrfs_unlock_extent(&inode->io_tree, start, end, &cached_state);
> >
> > --
> > 2.50.1
> >
> >
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2025-08-26 11:14 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-26 9:56 [PATCH v3] btrfs: do more strict compressed read merge check Qu Wenruo
2025-08-26 11:01 ` Filipe Manana
2025-08-26 11:13 ` Filipe Manana
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).