* [PATCH v2 0/4] btrfs: allow creating inline data extents for sector size < page size case
@ 2025-02-15 8:34 Qu Wenruo
2025-02-15 8:34 ` [PATCH v2 1/4] btrfs: fix inline data extent reads which zero out the remaining part Qu Wenruo
` (4 more replies)
0 siblings, 5 replies; 12+ messages in thread
From: Qu Wenruo @ 2025-02-15 8:34 UTC (permalink / raw)
To: linux-btrfs
[CHANGELOG]
v2:
- Add the previous inline read fix into the series
- Add a patch to remove the subpage experimental warning message
The main reason for the warning is the lack of some features, but it's
no longer the case.
For btrfs block size < page size (subpage), there used to a list of
features that are not supported:
- RAID56
Added in v5.19
- Block perfect compressed write
Added in v6.13, previously only page aligned range can go through
compressed write path.
- Inline data extent creation
But now the only feature that is missing is only inline data extent
creation.
And all technical problems are solved in v6.13, it's time for us to
allow subpage btrfs to create inline data extents.
The first patch is to fix a bug that can only be triggered with recent
partial uptodate folio support.
The second patch fixes a minor issue for qgroup accounting for inlined
data extents.
The third path enables inline data extent creation for subpage btrfs.
And finally remove the experimental warning message for subpage btrfs.
Qu Wenruo (4):
btrfs: fix inline data extent reads which zero out the remaining part
btrfs: fix the qgroup data free range for inline data extents
btrfs: allow inline data extents creation if sector size < page size
btrfs: remove the subpage related warning message
fs/btrfs/disk-io.c | 5 -----
fs/btrfs/inode.c | 30 ++++++++++--------------------
2 files changed, 10 insertions(+), 25 deletions(-)
--
2.48.1
^ permalink raw reply [flat|nested] 12+ messages in thread* [PATCH v2 1/4] btrfs: fix inline data extent reads which zero out the remaining part 2025-02-15 8:34 [PATCH v2 0/4] btrfs: allow creating inline data extents for sector size < page size case Qu Wenruo @ 2025-02-15 8:34 ` Qu Wenruo 2025-02-21 12:32 ` Filipe Manana 2025-02-15 8:34 ` [PATCH v2 2/4] btrfs: fix the qgroup data free range for inline data extents Qu Wenruo ` (3 subsequent siblings) 4 siblings, 1 reply; 12+ messages in thread From: Qu Wenruo @ 2025-02-15 8:34 UTC (permalink / raw) To: linux-btrfs [BUG in DEVEL BRANCH] This bug itself can only be reproduced with the following out-of-tree patches: btrfs: allow inline data extents creation if sector size < page size btrfs: allow buffered write to skip full page if it's sector aligned With those out-of-tree patches, we can hit a data corruption: # mkfs.btrfs -f -s 4k $dev # mount $dev $mnt -o compress=zstd # xfs_io -f -c "pwrite 0 4k" $mnt/foobar # sync # echo 3 > /proc/sys/vm/drop_caches # xfs_io -f -c" pwrite 8k 4k" $mnt/foobar # md5sum $mnt/foobar 65df683add4707de8200bad14745b9ec Meanwhile such workload should result a md5sum of 277f3840b275c74d01e979ea9d75ac19 [CAUSE] The first buffered write into range [0, 4k) will result a compressed inline extent (relies on the patch "btrfs: allow inline data extents creation if sector size < page size" to create such inline extent): item 6 key (257 EXTENT_DATA 0) itemoff 15823 itemsize 40 generation 9 type 0 (inline) inline extent data size 19 ram_bytes 4096 compression 3 (zstd) Then all page cache is dropped, and we do the new write into range [8K, 12K) With the out-of-tree patch "btrfs: allow buffered write to skip full page if it's sector aligned", such aligned write won't trigger the full folio read, so we just dirtied range [8K, 12K), and range [0, 4K) is not uptodate. Then md5sum triggered the full folio read, causing us to read the inlined data extent. Then inside function read_inline_extent() and uncomress_inline(), we zero out all the remaining part of the folio, including the new dirtied range [8K, 12K), leading to the corruption. [FIX] Thankfully the bug is not yet reaching any end users. For upstream kernel, the [8K, 12K) write itself will trigger the full folio read before doing any write, thus everything is still fine. Furthermore, for the existing btrfs users with sector size < page size (the most common one is Asahi Linux) with inline data extents created from x86_64, they are still fine, because two factors are saving us: - Inline extents are always at offset 0 - Folio read always follows the file offset order So we always read out the inline extent, zeroing the remaining folio (which has no contents yet), then read the next sector, copying the correct content to the zeroed out part. No end users are affected thankfully. The fix is to make both read_inline_extent() and uncomress_inline() to only zero out the sector, not the whole page. Signed-off-by: Qu Wenruo <wqu@suse.com> --- fs/btrfs/inode.c | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 2620c554917f..ea60123a28a2 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -6780,6 +6780,7 @@ static noinline int uncompress_inline(struct btrfs_path *path, { int ret; struct extent_buffer *leaf = path->nodes[0]; + const u32 sectorsize = leaf->fs_info->sectorsize; char *tmp; size_t max_size; unsigned long inline_size; @@ -6808,17 +6809,19 @@ static noinline int uncompress_inline(struct btrfs_path *path, * cover that region here. */ - if (max_size < PAGE_SIZE) - folio_zero_range(folio, max_size, PAGE_SIZE - max_size); + if (max_size < sectorsize) + folio_zero_range(folio, max_size, sectorsize - max_size); kfree(tmp); return ret; } -static int read_inline_extent(struct btrfs_path *path, struct folio *folio) +static int read_inline_extent(struct btrfs_fs_info *fs_info, + struct btrfs_path *path, struct folio *folio) { struct btrfs_file_extent_item *fi; void *kaddr; size_t copy_size; + const u32 sectorsize = fs_info->sectorsize; if (!folio || folio_test_uptodate(folio)) return 0; @@ -6836,8 +6839,8 @@ static int read_inline_extent(struct btrfs_path *path, struct folio *folio) read_extent_buffer(path->nodes[0], kaddr, btrfs_file_extent_inline_start(fi), copy_size); kunmap_local(kaddr); - if (copy_size < PAGE_SIZE) - folio_zero_range(folio, copy_size, PAGE_SIZE - copy_size); + if (copy_size < sectorsize) + folio_zero_range(folio, copy_size, sectorsize - copy_size); return 0; } @@ -7012,7 +7015,7 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode *inode, ASSERT(em->disk_bytenr == EXTENT_MAP_INLINE); ASSERT(em->len == fs_info->sectorsize); - ret = read_inline_extent(path, folio); + ret = read_inline_extent(fs_info, path, folio); if (ret < 0) goto out; goto insert; -- 2.48.1 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH v2 1/4] btrfs: fix inline data extent reads which zero out the remaining part 2025-02-15 8:34 ` [PATCH v2 1/4] btrfs: fix inline data extent reads which zero out the remaining part Qu Wenruo @ 2025-02-21 12:32 ` Filipe Manana 2025-02-21 22:39 ` Qu Wenruo 0 siblings, 1 reply; 12+ messages in thread From: Filipe Manana @ 2025-02-21 12:32 UTC (permalink / raw) To: Qu Wenruo; +Cc: linux-btrfs On Sat, Feb 15, 2025 at 8:34 AM Qu Wenruo <wqu@suse.com> wrote: > > [BUG in DEVEL BRANCH] > This bug itself can only be reproduced with the following out-of-tree > patches: > > btrfs: allow inline data extents creation if sector size < page size At least this patch could be part of this patchset no? It seems related. It makes it harder to review with unmerged dependencies. > btrfs: allow buffered write to skip full page if it's sector aligned > > With those out-of-tree patches, we can hit a data corruption: > > # mkfs.btrfs -f -s 4k $dev > # mount $dev $mnt -o compress=zstd > # xfs_io -f -c "pwrite 0 4k" $mnt/foobar > # sync > # echo 3 > /proc/sys/vm/drop_caches > # xfs_io -f -c" pwrite 8k 4k" $mnt/foobar > # md5sum $mnt/foobar > 65df683add4707de8200bad14745b9ec > > Meanwhile such workload should result a md5sum of > 277f3840b275c74d01e979ea9d75ac19 So this is hard for us human beings to understand - we don't compute checksums in our heads. So something easier: # mkfs.btrfs -f -s 4k $dev # mount $dev $mnt -o compress=zstd # xfs_io -f -c "pwrite -S 0xab 0 4k" $mnt/foobar # sync # echo 3 > /proc/sys/vm/drop_caches # xfs_io -f -c "pwrite -S 0xcd 8k 4k" $mnt/foobar # od -A d -t x1 $mnt/foobar Now display the result of od which is easy to understand and summarises repeated patterns. It should be obvious here that the expected pattern isn't observed, bytes range 0 to 4095 full of 0xab and 8192 to 12K full of 0xcd. See, a lot easier for anyone to understand rather than comparing checksums. > > [CAUSE] > The first buffered write into range [0, 4k) will result a compressed > inline extent (relies on the patch "btrfs: allow inline data extents > creation if sector size < page size" to create such inline extent): > > item 6 key (257 EXTENT_DATA 0) itemoff 15823 itemsize 40 > generation 9 type 0 (inline) > inline extent data size 19 ram_bytes 4096 compression 3 (zstd) > > Then all page cache is dropped, and we do the new write into range > [8K, 12K) > > With the out-of-tree patch "btrfs: allow buffered write to skip full page if > it's sector aligned", such aligned write won't trigger the full folio > read, so we just dirtied range [8K, 12K), and range [0, 4K) is not > uptodate. And without that out-of-tree patch, do we have any problem? If that patch creates a problem then perhaps fix it before being merged or at least include it early in this patchset? See, at least for me it's confusing to have patches saying they fix a problem that happens when some other unmerged patch is applied. It raises the doubt if that other patch should be fixed instead and therefore making this one not needed, or at least if they should be in the same patchset. It's not clear here if this patch serves other purposes other than fixing that other out-of-tree patch. The patch itself looks fine, but all these references to other unmerged patches and that they cause problems, make it hard and confusing to review. Sorry, it's just hard to follow these patchsets that depend on other not yet merged patchsets :( Thanks. > > Then md5sum triggered the full folio read, causing us to read the > inlined data extent. > > Then inside function read_inline_extent() and uncomress_inline(), we zero > out all the remaining part of the folio, including the new dirtied range > [8K, 12K), leading to the corruption. > > [FIX] > Thankfully the bug is not yet reaching any end users. > For upstream kernel, the [8K, 12K) write itself will trigger the full > folio read before doing any write, thus everything is still fine. > > Furthermore, for the existing btrfs users with sector size < page size > (the most common one is Asahi Linux) with inline data extents created > from x86_64, they are still fine, because two factors are saving us: > > - Inline extents are always at offset 0 > > - Folio read always follows the file offset order > > So we always read out the inline extent, zeroing the remaining folio > (which has no contents yet), then read the next sector, copying the > correct content to the zeroed out part. > No end users are affected thankfully. > > The fix is to make both read_inline_extent() and uncomress_inline() to > only zero out the sector, not the whole page. > > Signed-off-by: Qu Wenruo <wqu@suse.com> > --- > fs/btrfs/inode.c | 15 +++++++++------ > 1 file changed, 9 insertions(+), 6 deletions(-) > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index 2620c554917f..ea60123a28a2 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -6780,6 +6780,7 @@ static noinline int uncompress_inline(struct btrfs_path *path, > { > int ret; > struct extent_buffer *leaf = path->nodes[0]; > + const u32 sectorsize = leaf->fs_info->sectorsize; > char *tmp; > size_t max_size; > unsigned long inline_size; > @@ -6808,17 +6809,19 @@ static noinline int uncompress_inline(struct btrfs_path *path, > * cover that region here. > */ > > - if (max_size < PAGE_SIZE) > - folio_zero_range(folio, max_size, PAGE_SIZE - max_size); > + if (max_size < sectorsize) > + folio_zero_range(folio, max_size, sectorsize - max_size); > kfree(tmp); > return ret; > } > > -static int read_inline_extent(struct btrfs_path *path, struct folio *folio) > +static int read_inline_extent(struct btrfs_fs_info *fs_info, > + struct btrfs_path *path, struct folio *folio) > { > struct btrfs_file_extent_item *fi; > void *kaddr; > size_t copy_size; > + const u32 sectorsize = fs_info->sectorsize; > > if (!folio || folio_test_uptodate(folio)) > return 0; > @@ -6836,8 +6839,8 @@ static int read_inline_extent(struct btrfs_path *path, struct folio *folio) > read_extent_buffer(path->nodes[0], kaddr, > btrfs_file_extent_inline_start(fi), copy_size); > kunmap_local(kaddr); > - if (copy_size < PAGE_SIZE) > - folio_zero_range(folio, copy_size, PAGE_SIZE - copy_size); > + if (copy_size < sectorsize) > + folio_zero_range(folio, copy_size, sectorsize - copy_size); > return 0; > } > > @@ -7012,7 +7015,7 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode *inode, > ASSERT(em->disk_bytenr == EXTENT_MAP_INLINE); > ASSERT(em->len == fs_info->sectorsize); > > - ret = read_inline_extent(path, folio); > + ret = read_inline_extent(fs_info, path, folio); > if (ret < 0) > goto out; > goto insert; > -- > 2.48.1 > > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 1/4] btrfs: fix inline data extent reads which zero out the remaining part 2025-02-21 12:32 ` Filipe Manana @ 2025-02-21 22:39 ` Qu Wenruo 2025-02-24 16:47 ` Filipe Manana 2025-02-27 14:36 ` Daniel Vacek 0 siblings, 2 replies; 12+ messages in thread From: Qu Wenruo @ 2025-02-21 22:39 UTC (permalink / raw) To: Filipe Manana, Qu Wenruo; +Cc: linux-btrfs 在 2025/2/21 23:02, Filipe Manana 写道: > On Sat, Feb 15, 2025 at 8:34 AM Qu Wenruo <wqu@suse.com> wrote: >> >> [BUG in DEVEL BRANCH] >> This bug itself can only be reproduced with the following out-of-tree >> patches: >> >> btrfs: allow inline data extents creation if sector size < page size > > At least this patch could be part of this patchset no? It seems related. > It makes it harder to review with unmerged dependencies. Sure, I can merge all those patches into a much larger series. > >> btrfs: allow buffered write to skip full page if it's sector aligned >> >> With those out-of-tree patches, we can hit a data corruption: >> >> # mkfs.btrfs -f -s 4k $dev >> # mount $dev $mnt -o compress=zstd >> # xfs_io -f -c "pwrite 0 4k" $mnt/foobar >> # sync >> # echo 3 > /proc/sys/vm/drop_caches >> # xfs_io -f -c" pwrite 8k 4k" $mnt/foobar >> # md5sum $mnt/foobar >> 65df683add4707de8200bad14745b9ec >> >> Meanwhile such workload should result a md5sum of >> 277f3840b275c74d01e979ea9d75ac19 > > So this is hard for us human beings to understand - we don't compute > checksums in our heads. > So something easier: > > # mkfs.btrfs -f -s 4k $dev > # mount $dev $mnt -o compress=zstd > # xfs_io -f -c "pwrite -S 0xab 0 4k" $mnt/foobar > # sync > # echo 3 > /proc/sys/vm/drop_caches > # xfs_io -f -c "pwrite -S 0xcd 8k 4k" $mnt/foobar > # od -A d -t x1 $mnt/foobar > > Now display the result of od which is easy to understand and > summarises repeated patterns. > It should be obvious here that the expected pattern isn't observed, > bytes range 0 to 4095 full of 0xab and 8192 to 12K full of 0xcd. > > See, a lot easier for anyone to understand rather than comparing checksums. Thanks a lot, will go that reproducer in the commit message. > > >> >> [CAUSE] >> The first buffered write into range [0, 4k) will result a compressed >> inline extent (relies on the patch "btrfs: allow inline data extents >> creation if sector size < page size" to create such inline extent): >> >> item 6 key (257 EXTENT_DATA 0) itemoff 15823 itemsize 40 >> generation 9 type 0 (inline) >> inline extent data size 19 ram_bytes 4096 compression 3 (zstd) >> >> Then all page cache is dropped, and we do the new write into range >> [8K, 12K) >> >> With the out-of-tree patch "btrfs: allow buffered write to skip full page if >> it's sector aligned", such aligned write won't trigger the full folio >> read, so we just dirtied range [8K, 12K), and range [0, 4K) is not >> uptodate. > > And without that out-of-tree patch, do we have any problem? Fortunately no. > If that patch creates a problem then perhaps fix it before being > merged or at least include it early in this patchset? I'll put this one before that patch in the merged series. > > See, at least for me it's confusing to have patches saying they fix a > problem that happens when some other unmerged patch is applied. > It raises the doubt if that other patch should be fixed instead and > therefore making this one not needed, or at least if they should be in > the same patchset. But I'm wondering what's the proper way to handle such cases? Is putting this one before that patch enough? I really want to express some strong dependency, as if some one backported that partial uptodate folio support, it will easily screw up a lot of things. On the other hand, it's also very hard to explain why this patch itself is needed, without mentioning the future behavior change. Any good suggestions? Especially I'm pretty sure such pattern will happen again and again as we're approaching larger data folios support. Thanks, Qu > > It's not clear here if this patch serves other purposes other than > fixing that other out-of-tree patch. > > The patch itself looks fine, but all these references to other > unmerged patches and that they cause problems, make it hard and > confusing to review. > Sorry, it's just hard to follow these patchsets that depend on other > not yet merged patchsets :( > > Thanks. > >> >> Then md5sum triggered the full folio read, causing us to read the >> inlined data extent. >> >> Then inside function read_inline_extent() and uncomress_inline(), we zero >> out all the remaining part of the folio, including the new dirtied range >> [8K, 12K), leading to the corruption. >> >> [FIX] >> Thankfully the bug is not yet reaching any end users. >> For upstream kernel, the [8K, 12K) write itself will trigger the full >> folio read before doing any write, thus everything is still fine. >> >> Furthermore, for the existing btrfs users with sector size < page size >> (the most common one is Asahi Linux) with inline data extents created >> from x86_64, they are still fine, because two factors are saving us: >> >> - Inline extents are always at offset 0 >> >> - Folio read always follows the file offset order >> >> So we always read out the inline extent, zeroing the remaining folio >> (which has no contents yet), then read the next sector, copying the >> correct content to the zeroed out part. >> No end users are affected thankfully. >> >> The fix is to make both read_inline_extent() and uncomress_inline() to >> only zero out the sector, not the whole page. >> >> Signed-off-by: Qu Wenruo <wqu@suse.com> >> --- >> fs/btrfs/inode.c | 15 +++++++++------ >> 1 file changed, 9 insertions(+), 6 deletions(-) >> >> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c >> index 2620c554917f..ea60123a28a2 100644 >> --- a/fs/btrfs/inode.c >> +++ b/fs/btrfs/inode.c >> @@ -6780,6 +6780,7 @@ static noinline int uncompress_inline(struct btrfs_path *path, >> { >> int ret; >> struct extent_buffer *leaf = path->nodes[0]; >> + const u32 sectorsize = leaf->fs_info->sectorsize; >> char *tmp; >> size_t max_size; >> unsigned long inline_size; >> @@ -6808,17 +6809,19 @@ static noinline int uncompress_inline(struct btrfs_path *path, >> * cover that region here. >> */ >> >> - if (max_size < PAGE_SIZE) >> - folio_zero_range(folio, max_size, PAGE_SIZE - max_size); >> + if (max_size < sectorsize) >> + folio_zero_range(folio, max_size, sectorsize - max_size); >> kfree(tmp); >> return ret; >> } >> >> -static int read_inline_extent(struct btrfs_path *path, struct folio *folio) >> +static int read_inline_extent(struct btrfs_fs_info *fs_info, >> + struct btrfs_path *path, struct folio *folio) >> { >> struct btrfs_file_extent_item *fi; >> void *kaddr; >> size_t copy_size; >> + const u32 sectorsize = fs_info->sectorsize; >> >> if (!folio || folio_test_uptodate(folio)) >> return 0; >> @@ -6836,8 +6839,8 @@ static int read_inline_extent(struct btrfs_path *path, struct folio *folio) >> read_extent_buffer(path->nodes[0], kaddr, >> btrfs_file_extent_inline_start(fi), copy_size); >> kunmap_local(kaddr); >> - if (copy_size < PAGE_SIZE) >> - folio_zero_range(folio, copy_size, PAGE_SIZE - copy_size); >> + if (copy_size < sectorsize) >> + folio_zero_range(folio, copy_size, sectorsize - copy_size); >> return 0; >> } >> >> @@ -7012,7 +7015,7 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode *inode, >> ASSERT(em->disk_bytenr == EXTENT_MAP_INLINE); >> ASSERT(em->len == fs_info->sectorsize); >> >> - ret = read_inline_extent(path, folio); >> + ret = read_inline_extent(fs_info, path, folio); >> if (ret < 0) >> goto out; >> goto insert; >> -- >> 2.48.1 >> >> > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 1/4] btrfs: fix inline data extent reads which zero out the remaining part 2025-02-21 22:39 ` Qu Wenruo @ 2025-02-24 16:47 ` Filipe Manana 2025-02-27 14:36 ` Daniel Vacek 1 sibling, 0 replies; 12+ messages in thread From: Filipe Manana @ 2025-02-24 16:47 UTC (permalink / raw) To: Qu Wenruo; +Cc: Qu Wenruo, linux-btrfs On Fri, Feb 21, 2025 at 10:39 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > > > > 在 2025/2/21 23:02, Filipe Manana 写道: > > On Sat, Feb 15, 2025 at 8:34 AM Qu Wenruo <wqu@suse.com> wrote: > >> > >> [BUG in DEVEL BRANCH] > >> This bug itself can only be reproduced with the following out-of-tree > >> patches: > >> > >> btrfs: allow inline data extents creation if sector size < page size > > > > At least this patch could be part of this patchset no? It seems related. > > It makes it harder to review with unmerged dependencies. > > Sure, I can merge all those patches into a much larger series. > > > > >> btrfs: allow buffered write to skip full page if it's sector aligned > >> > >> With those out-of-tree patches, we can hit a data corruption: > >> > >> # mkfs.btrfs -f -s 4k $dev > >> # mount $dev $mnt -o compress=zstd > >> # xfs_io -f -c "pwrite 0 4k" $mnt/foobar > >> # sync > >> # echo 3 > /proc/sys/vm/drop_caches > >> # xfs_io -f -c" pwrite 8k 4k" $mnt/foobar > >> # md5sum $mnt/foobar > >> 65df683add4707de8200bad14745b9ec > >> > >> Meanwhile such workload should result a md5sum of > >> 277f3840b275c74d01e979ea9d75ac19 > > > > So this is hard for us human beings to understand - we don't compute > > checksums in our heads. > > So something easier: > > > > # mkfs.btrfs -f -s 4k $dev > > # mount $dev $mnt -o compress=zstd > > # xfs_io -f -c "pwrite -S 0xab 0 4k" $mnt/foobar > > # sync > > # echo 3 > /proc/sys/vm/drop_caches > > # xfs_io -f -c "pwrite -S 0xcd 8k 4k" $mnt/foobar > > # od -A d -t x1 $mnt/foobar > > > > Now display the result of od which is easy to understand and > > summarises repeated patterns. > > It should be obvious here that the expected pattern isn't observed, > > bytes range 0 to 4095 full of 0xab and 8192 to 12K full of 0xcd. > > > > See, a lot easier for anyone to understand rather than comparing checksums. > > Thanks a lot, will go that reproducer in the commit message. > > > > > >> > >> [CAUSE] > >> The first buffered write into range [0, 4k) will result a compressed > >> inline extent (relies on the patch "btrfs: allow inline data extents > >> creation if sector size < page size" to create such inline extent): > >> > >> item 6 key (257 EXTENT_DATA 0) itemoff 15823 itemsize 40 > >> generation 9 type 0 (inline) > >> inline extent data size 19 ram_bytes 4096 compression 3 (zstd) > >> > >> Then all page cache is dropped, and we do the new write into range > >> [8K, 12K) > >> > >> With the out-of-tree patch "btrfs: allow buffered write to skip full page if > >> it's sector aligned", such aligned write won't trigger the full folio > >> read, so we just dirtied range [8K, 12K), and range [0, 4K) is not > >> uptodate. > > > > And without that out-of-tree patch, do we have any problem? > > Fortunately no. > > > If that patch creates a problem then perhaps fix it before being > > merged or at least include it early in this patchset? > > I'll put this one before that patch in the merged series. > > > > > See, at least for me it's confusing to have patches saying they fix a > > problem that happens when some other unmerged patch is applied. > > It raises the doubt if that other patch should be fixed instead and > > therefore making this one not needed, or at least if they should be in > > the same patchset. > > But I'm wondering what's the proper way to handle such cases? > > Is putting this one before that patch enough? > > I really want to express some strong dependency, as if some one > backported that partial uptodate folio support, it will easily screw up > a lot of things. You mean if someone decided to apply that out-of-tree patch into their own custom built kernel and use it? If so I don't think they should be doing it, we normally only support code that was merged into Linus' tree. > > On the other hand, it's also very hard to explain why this patch itself > is needed, without mentioning the future behavior change. > > Any good suggestions? Especially I'm pretty sure such pattern will > happen again and again as we're approaching larger data folios support. I think the pattern only happens because patches for sub-pages block size keep accumulating without being merged, as you are the only one doing development in that area. I don't have a perfect solution for that. I see you've sent another patchset reordering things and making the dependency more clear. I'll look into it. Thanks. > > Thanks, > Qu > > > > > It's not clear here if this patch serves other purposes other than > > fixing that other out-of-tree patch. > > > > The patch itself looks fine, but all these references to other > > unmerged patches and that they cause problems, make it hard and > > confusing to review. > > Sorry, it's just hard to follow these patchsets that depend on other > > not yet merged patchsets :( > > > > Thanks. > > > >> > >> Then md5sum triggered the full folio read, causing us to read the > >> inlined data extent. > >> > >> Then inside function read_inline_extent() and uncomress_inline(), we zero > >> out all the remaining part of the folio, including the new dirtied range > >> [8K, 12K), leading to the corruption. > >> > >> [FIX] > >> Thankfully the bug is not yet reaching any end users. > >> For upstream kernel, the [8K, 12K) write itself will trigger the full > >> folio read before doing any write, thus everything is still fine. > >> > >> Furthermore, for the existing btrfs users with sector size < page size > >> (the most common one is Asahi Linux) with inline data extents created > >> from x86_64, they are still fine, because two factors are saving us: > >> > >> - Inline extents are always at offset 0 > >> > >> - Folio read always follows the file offset order > >> > >> So we always read out the inline extent, zeroing the remaining folio > >> (which has no contents yet), then read the next sector, copying the > >> correct content to the zeroed out part. > >> No end users are affected thankfully. > >> > >> The fix is to make both read_inline_extent() and uncomress_inline() to > >> only zero out the sector, not the whole page. > >> > >> Signed-off-by: Qu Wenruo <wqu@suse.com> > >> --- > >> fs/btrfs/inode.c | 15 +++++++++------ > >> 1 file changed, 9 insertions(+), 6 deletions(-) > >> > >> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > >> index 2620c554917f..ea60123a28a2 100644 > >> --- a/fs/btrfs/inode.c > >> +++ b/fs/btrfs/inode.c > >> @@ -6780,6 +6780,7 @@ static noinline int uncompress_inline(struct btrfs_path *path, > >> { > >> int ret; > >> struct extent_buffer *leaf = path->nodes[0]; > >> + const u32 sectorsize = leaf->fs_info->sectorsize; > >> char *tmp; > >> size_t max_size; > >> unsigned long inline_size; > >> @@ -6808,17 +6809,19 @@ static noinline int uncompress_inline(struct btrfs_path *path, > >> * cover that region here. > >> */ > >> > >> - if (max_size < PAGE_SIZE) > >> - folio_zero_range(folio, max_size, PAGE_SIZE - max_size); > >> + if (max_size < sectorsize) > >> + folio_zero_range(folio, max_size, sectorsize - max_size); > >> kfree(tmp); > >> return ret; > >> } > >> > >> -static int read_inline_extent(struct btrfs_path *path, struct folio *folio) > >> +static int read_inline_extent(struct btrfs_fs_info *fs_info, > >> + struct btrfs_path *path, struct folio *folio) > >> { > >> struct btrfs_file_extent_item *fi; > >> void *kaddr; > >> size_t copy_size; > >> + const u32 sectorsize = fs_info->sectorsize; > >> > >> if (!folio || folio_test_uptodate(folio)) > >> return 0; > >> @@ -6836,8 +6839,8 @@ static int read_inline_extent(struct btrfs_path *path, struct folio *folio) > >> read_extent_buffer(path->nodes[0], kaddr, > >> btrfs_file_extent_inline_start(fi), copy_size); > >> kunmap_local(kaddr); > >> - if (copy_size < PAGE_SIZE) > >> - folio_zero_range(folio, copy_size, PAGE_SIZE - copy_size); > >> + if (copy_size < sectorsize) > >> + folio_zero_range(folio, copy_size, sectorsize - copy_size); > >> return 0; > >> } > >> > >> @@ -7012,7 +7015,7 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode *inode, > >> ASSERT(em->disk_bytenr == EXTENT_MAP_INLINE); > >> ASSERT(em->len == fs_info->sectorsize); > >> > >> - ret = read_inline_extent(path, folio); > >> + ret = read_inline_extent(fs_info, path, folio); > >> if (ret < 0) > >> goto out; > >> goto insert; > >> -- > >> 2.48.1 > >> > >> > > > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 1/4] btrfs: fix inline data extent reads which zero out the remaining part 2025-02-21 22:39 ` Qu Wenruo 2025-02-24 16:47 ` Filipe Manana @ 2025-02-27 14:36 ` Daniel Vacek 2025-02-27 23:39 ` Qu Wenruo 1 sibling, 1 reply; 12+ messages in thread From: Daniel Vacek @ 2025-02-27 14:36 UTC (permalink / raw) To: Qu Wenruo; +Cc: Filipe Manana, Qu Wenruo, linux-btrfs On Fri, 21 Feb 2025 at 23:39, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > > > > 在 2025/2/21 23:02, Filipe Manana 写道: > > On Sat, Feb 15, 2025 at 8:34 AM Qu Wenruo <wqu@suse.com> wrote: > >> > >> [BUG in DEVEL BRANCH] > >> This bug itself can only be reproduced with the following out-of-tree > >> patches: > >> > >> btrfs: allow inline data extents creation if sector size < page size > > > > At least this patch could be part of this patchset no? It seems related. > > It makes it harder to review with unmerged dependencies. > > Sure, I can merge all those patches into a much larger series. > > > > >> btrfs: allow buffered write to skip full page if it's sector aligned > >> > >> With those out-of-tree patches, we can hit a data corruption: > >> > >> # mkfs.btrfs -f -s 4k $dev > >> # mount $dev $mnt -o compress=zstd > >> # xfs_io -f -c "pwrite 0 4k" $mnt/foobar > >> # sync > >> # echo 3 > /proc/sys/vm/drop_caches > >> # xfs_io -f -c" pwrite 8k 4k" $mnt/foobar > >> # md5sum $mnt/foobar > >> 65df683add4707de8200bad14745b9ec > >> > >> Meanwhile such workload should result a md5sum of > >> 277f3840b275c74d01e979ea9d75ac19 > > > > So this is hard for us human beings to understand - we don't compute > > checksums in our heads. > > So something easier: > > > > # mkfs.btrfs -f -s 4k $dev > > # mount $dev $mnt -o compress=zstd > > # xfs_io -f -c "pwrite -S 0xab 0 4k" $mnt/foobar > > # sync > > # echo 3 > /proc/sys/vm/drop_caches > > # xfs_io -f -c "pwrite -S 0xcd 8k 4k" $mnt/foobar > > # od -A d -t x1 $mnt/foobar > > > > Now display the result of od which is easy to understand and > > summarises repeated patterns. > > It should be obvious here that the expected pattern isn't observed, > > bytes range 0 to 4095 full of 0xab and 8192 to 12K full of 0xcd. > > > > See, a lot easier for anyone to understand rather than comparing checksums. > > Thanks a lot, will go that reproducer in the commit message. > > > > > >> > >> [CAUSE] > >> The first buffered write into range [0, 4k) will result a compressed > >> inline extent (relies on the patch "btrfs: allow inline data extents > >> creation if sector size < page size" to create such inline extent): > >> > >> item 6 key (257 EXTENT_DATA 0) itemoff 15823 itemsize 40 > >> generation 9 type 0 (inline) > >> inline extent data size 19 ram_bytes 4096 compression 3 (zstd) > >> > >> Then all page cache is dropped, and we do the new write into range > >> [8K, 12K) > >> > >> With the out-of-tree patch "btrfs: allow buffered write to skip full page if > >> it's sector aligned", such aligned write won't trigger the full folio > >> read, so we just dirtied range [8K, 12K), and range [0, 4K) is not > >> uptodate. > > > > And without that out-of-tree patch, do we have any problem? > > Fortunately no. > > > If that patch creates a problem then perhaps fix it before being > > merged or at least include it early in this patchset? > > I'll put this one before that patch in the merged series. > > > > > See, at least for me it's confusing to have patches saying they fix a > > problem that happens when some other unmerged patch is applied. > > It raises the doubt if that other patch should be fixed instead and > > therefore making this one not needed, or at least if they should be in > > the same patchset. > > But I'm wondering what's the proper way to handle such cases? > > Is putting this one before that patch enough? > > I really want to express some strong dependency, as if some one > backported that partial uptodate folio support, it will easily screw up > a lot of things. > > On the other hand, it's also very hard to explain why this patch itself > is needed, without mentioning the future behavior change. > > Any good suggestions? Especially I'm pretty sure such pattern will > happen again and again as we're approaching larger data folios support. Perhaps an fstest which fails having the later patch without this one would help? ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 1/4] btrfs: fix inline data extent reads which zero out the remaining part 2025-02-27 14:36 ` Daniel Vacek @ 2025-02-27 23:39 ` Qu Wenruo 2025-03-06 9:24 ` David Sterba 0 siblings, 1 reply; 12+ messages in thread From: Qu Wenruo @ 2025-02-27 23:39 UTC (permalink / raw) To: Daniel Vacek; +Cc: Filipe Manana, Qu Wenruo, linux-btrfs 在 2025/2/28 01:06, Daniel Vacek 写道: > On Fri, 21 Feb 2025 at 23:39, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: [...] >> Any good suggestions? Especially I'm pretty sure such pattern will >> happen again and again as we're approaching larger data folios support. > > Perhaps an fstest which fails having the later patch without this one > would help? > BTW, for the backports I really mean stable backports, not someone picking up those out-of-tree patches. And for the fstests failure suggestion, it's already covered by both generic fstests runs and a dedicated btrfs specific one. But unfortunately it's not the first nor the last time that stable branches picks some patches without their dependency. So I guess what we can do is really just checking the stable patch mails and manually reject those incorrect backports, at least for now. Thanks, Qu ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 1/4] btrfs: fix inline data extent reads which zero out the remaining part 2025-02-27 23:39 ` Qu Wenruo @ 2025-03-06 9:24 ` David Sterba 0 siblings, 0 replies; 12+ messages in thread From: David Sterba @ 2025-03-06 9:24 UTC (permalink / raw) To: Qu Wenruo; +Cc: Daniel Vacek, Filipe Manana, Qu Wenruo, linux-btrfs On Fri, Feb 28, 2025 at 10:09:46AM +1030, Qu Wenruo wrote: > 在 2025/2/28 01:06, Daniel Vacek 写道: > > On Fri, 21 Feb 2025 at 23:39, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > [...] > >> Any good suggestions? Especially I'm pretty sure such pattern will > >> happen again and again as we're approaching larger data folios support. > > > > Perhaps an fstest which fails having the later patch without this one > > would help? > > BTW, for the backports I really mean stable backports, not someone > picking up those out-of-tree patches. > > And for the fstests failure suggestion, it's already covered by both > generic fstests runs and a dedicated btrfs specific one. > > But unfortunately it's not the first nor the last time that stable > branches picks some patches without their dependency. > > So I guess what we can do is really just checking the stable patch mails > and manually reject those incorrect backports, at least for now. I try to do that as I get CCed on all of them, other than patch authors and the mailinglist. It's getting harder to evaluate the patch for old stable trees, even with the dependencies listed. If you see a patch backport and feel it's wrong, incomplete or not necessary please reply. Most of them are OK because we do proper patch separation but cleanups and refactoring introduce subtle changes that do not transfer across distant versions. ^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v2 2/4] btrfs: fix the qgroup data free range for inline data extents 2025-02-15 8:34 [PATCH v2 0/4] btrfs: allow creating inline data extents for sector size < page size case Qu Wenruo 2025-02-15 8:34 ` [PATCH v2 1/4] btrfs: fix inline data extent reads which zero out the remaining part Qu Wenruo @ 2025-02-15 8:34 ` Qu Wenruo 2025-02-15 8:34 ` [PATCH v2 3/4] btrfs: allow inline data extents creation if sector size < page size Qu Wenruo ` (2 subsequent siblings) 4 siblings, 0 replies; 12+ messages in thread From: Qu Wenruo @ 2025-02-15 8:34 UTC (permalink / raw) To: linux-btrfs Inside function __cow_file_range_inline() since the inlined data no longer takes any data space, we need to free up the reserved space. However the code is still using the old page size == sector size assumption, and will not handle subpage case well. Thankfully it is not going to cause problem because we have two safe nets: - Inline data extents creation is disable for sector size < page size cases for now But it won't stay that for long. - btrfs_qgroup_free_data() will only clear ranges which are already reserved So even if we pass a range larger than what we need, it should still be fine, especially there should only be one sector reserved. But just for the sake of consistentcy, fix the call site to use sectorsize instead of page size. Signed-off-by: Qu Wenruo <wqu@suse.com> --- fs/btrfs/inode.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index ea60123a28a2..4b87fafa9944 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -672,7 +672,7 @@ static noinline int __cow_file_range_inline(struct btrfs_inode *inode, * And at reserve time, it's always aligned to page size, so * just free one page here. */ - btrfs_qgroup_free_data(inode, NULL, 0, PAGE_SIZE, NULL); + btrfs_qgroup_free_data(inode, NULL, 0, fs_info->sectorsize, NULL); btrfs_free_path(path); btrfs_end_transaction(trans); return ret; -- 2.48.1 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v2 3/4] btrfs: allow inline data extents creation if sector size < page size 2025-02-15 8:34 [PATCH v2 0/4] btrfs: allow creating inline data extents for sector size < page size case Qu Wenruo 2025-02-15 8:34 ` [PATCH v2 1/4] btrfs: fix inline data extent reads which zero out the remaining part Qu Wenruo 2025-02-15 8:34 ` [PATCH v2 2/4] btrfs: fix the qgroup data free range for inline data extents Qu Wenruo @ 2025-02-15 8:34 ` Qu Wenruo 2025-02-15 8:34 ` [PATCH v2 4/4] btrfs: remove the subpage related warning message Qu Wenruo 2025-02-15 14:42 ` [PATCH v2 0/4] btrfs: allow creating inline data extents for sector size < page size case Neal Gompa 4 siblings, 0 replies; 12+ messages in thread From: Qu Wenruo @ 2025-02-15 8:34 UTC (permalink / raw) To: linux-btrfs Previously inline data extents creation is disable if sector size < page size, as there are two blockage: - Possible mixed inline and regular data extents However this is also the case for sector size < page size cases, thus we do not treat mixed inline and regular extents as an error. So from day one, more mixed inline and regular extents are not a strong argument to disable inline extents. - Unable to handle async/inline delalloc range for sector size < page size cases This is fixed with the recent sector perfect compressed write support for sector size < page size cases. And this is the main technical blockage. With the major technical blockage already removed, we can enable inline data extents creation for sector size < page size, allowing the btrfs to have the same capacity no matter the page size. Signed-off-by: Qu Wenruo <wqu@suse.com> --- fs/btrfs/inode.c | 13 ------------- 1 file changed, 13 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 4b87fafa9944..7796e81dbb9d 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -566,19 +566,6 @@ static bool can_cow_file_range_inline(struct btrfs_inode *inode, if (offset != 0) return false; - /* - * Due to the page size limit, for subpage we can only trigger the - * writeback for the dirty sectors of page, that means data writeback - * is doing more writeback than what we want. - * - * This is especially unexpected for some call sites like fallocate, - * where we only increase i_size after everything is done. - * This means we can trigger inline extent even if we didn't want to. - * So here we skip inline extent creation completely. - */ - if (fs_info->sectorsize != PAGE_SIZE) - return false; - /* Inline extents are limited to sectorsize. */ if (size > fs_info->sectorsize) return false; -- 2.48.1 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v2 4/4] btrfs: remove the subpage related warning message 2025-02-15 8:34 [PATCH v2 0/4] btrfs: allow creating inline data extents for sector size < page size case Qu Wenruo ` (2 preceding siblings ...) 2025-02-15 8:34 ` [PATCH v2 3/4] btrfs: allow inline data extents creation if sector size < page size Qu Wenruo @ 2025-02-15 8:34 ` Qu Wenruo 2025-02-15 14:42 ` [PATCH v2 0/4] btrfs: allow creating inline data extents for sector size < page size case Neal Gompa 4 siblings, 0 replies; 12+ messages in thread From: Qu Wenruo @ 2025-02-15 8:34 UTC (permalink / raw) To: linux-btrfs Since the initial enablement of block size < page size support for btrfs in v5.15, we have hit several milestones for block size < page size (subpage) support: - RAID56 subpage support In v5.19 - Refactored scrub support to support subpage better In v6.4 - Block perfect (previously require page aligned range) compressed write In v6.13 - Various error handling fixes involving subpage In v6.14 Finally the only missing feature is the pretty simple and harmless inlined data extent creation, just done in previous patches. Now btrfs has all of its features ready for both regular and subpage cases, there is no reason to output a warning about the experimental subpage support, and we can finally remove it now. Signed-off-by: Qu Wenruo <wqu@suse.com> --- fs/btrfs/disk-io.c | 5 ----- 1 file changed, 5 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 8c31ba1b061e..1d64b3721e25 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3415,11 +3415,6 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device */ fs_info->max_inline = min_t(u64, fs_info->max_inline, fs_info->sectorsize); - if (sectorsize < PAGE_SIZE) - btrfs_warn(fs_info, - "read-write for sector size %u with page size %lu is experimental", - sectorsize, PAGE_SIZE); - ret = btrfs_init_workqueues(fs_info); if (ret) goto fail_sb_buffer; -- 2.48.1 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH v2 0/4] btrfs: allow creating inline data extents for sector size < page size case 2025-02-15 8:34 [PATCH v2 0/4] btrfs: allow creating inline data extents for sector size < page size case Qu Wenruo ` (3 preceding siblings ...) 2025-02-15 8:34 ` [PATCH v2 4/4] btrfs: remove the subpage related warning message Qu Wenruo @ 2025-02-15 14:42 ` Neal Gompa 4 siblings, 0 replies; 12+ messages in thread From: Neal Gompa @ 2025-02-15 14:42 UTC (permalink / raw) To: Qu Wenruo; +Cc: linux-btrfs On Sat, Feb 15, 2025 at 3:34 AM Qu Wenruo <wqu@suse.com> wrote: > > [CHANGELOG] > v2: > - Add the previous inline read fix into the series > - Add a patch to remove the subpage experimental warning message > The main reason for the warning is the lack of some features, but it's > no longer the case. > > For btrfs block size < page size (subpage), there used to a list of > features that are not supported: > > - RAID56 > Added in v5.19 > > - Block perfect compressed write > Added in v6.13, previously only page aligned range can go through > compressed write path. > > - Inline data extent creation > > But now the only feature that is missing is only inline data extent > creation. > > And all technical problems are solved in v6.13, it's time for us to > allow subpage btrfs to create inline data extents. > > The first patch is to fix a bug that can only be triggered with recent > partial uptodate folio support. > > The second patch fixes a minor issue for qgroup accounting for inlined > data extents. > > The third path enables inline data extent creation for subpage btrfs. > > And finally remove the experimental warning message for subpage btrfs. > > Qu Wenruo (4): > btrfs: fix inline data extent reads which zero out the remaining part > btrfs: fix the qgroup data free range for inline data extents > btrfs: allow inline data extents creation if sector size < page size > btrfs: remove the subpage related warning message > > fs/btrfs/disk-io.c | 5 ----- > fs/btrfs/inode.c | 30 ++++++++++-------------------- > 2 files changed, 10 insertions(+), 25 deletions(-) > > -- > 2.48.1 > > This is fantastic! :) The series looks good to me, so... Reviewed-by: Neal Gompa <neal@gompa.dev> -- 真実はいつも一つ!/ Always, there's only one truth! ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2025-03-06 9:24 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-02-15 8:34 [PATCH v2 0/4] btrfs: allow creating inline data extents for sector size < page size case Qu Wenruo 2025-02-15 8:34 ` [PATCH v2 1/4] btrfs: fix inline data extent reads which zero out the remaining part Qu Wenruo 2025-02-21 12:32 ` Filipe Manana 2025-02-21 22:39 ` Qu Wenruo 2025-02-24 16:47 ` Filipe Manana 2025-02-27 14:36 ` Daniel Vacek 2025-02-27 23:39 ` Qu Wenruo 2025-03-06 9:24 ` David Sterba 2025-02-15 8:34 ` [PATCH v2 2/4] btrfs: fix the qgroup data free range for inline data extents Qu Wenruo 2025-02-15 8:34 ` [PATCH v2 3/4] btrfs: allow inline data extents creation if sector size < page size Qu Wenruo 2025-02-15 8:34 ` [PATCH v2 4/4] btrfs: remove the subpage related warning message Qu Wenruo 2025-02-15 14:42 ` [PATCH v2 0/4] btrfs: allow creating inline data extents for sector size < page size case Neal Gompa
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox