* [PATCH v3 0/2] btrfs: fix data corruption/rsv leak in subpage zoned cases
@ 2024-03-07 3:16 Qu Wenruo
2024-03-07 3:16 ` [PATCH v3 1/2] btrfs: do not clear page dirty inside extent_write_locked_range() Qu Wenruo
` (2 more replies)
0 siblings, 3 replies; 4+ messages in thread
From: Qu Wenruo @ 2024-03-07 3:16 UTC (permalink / raw)
To: linux-btrfs
[CHANGELOG]
v3:
- Use the minimal fsstress workload with trace_printk() output to
explain the bug better
v2:
- Update the commit message for the first patch
As there is something wrong with the ASCII art of the memory layout.
[REPO]
https://github.com/adam900710/linux/tree/subpage_delalloc
The repo includes the subpage delalloc rework, or subpage zoned won't
work at all.
Although my previous subpage delalloc rework fixes quite a lot of
crashes of subpage + zoned support, it's still pretty easy to cause rsv
leak using single thread fsstress.
It turns out that, it's not a simple problem of some rsv leak, but
certain dirty data ranges never got written back and just skipped with
its dirty flags cleared, no wonder that would lead to rsv leak.
The root cause is again in the extent_write_locked_range() function
doing weird subpage incompatible behaviors, especially when it clears
the page dirty flag for the whole page, causing __extent_writepage_io()
unable to locate further dirty ranges to be submitted.
The first patch would solve the problem, meanwhile for the 2nd patch
it's a cleanup, as we will never hit the error for current subpage +
zoned cases.
Qu Wenruo (2):
btrfs: do not clear page dirty inside extent_write_locked_range()
btrfs: make extent_write_locked_range() to handle subpage writeback
correctly
fs/btrfs/extent_io.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
--
2.44.0
^ permalink raw reply [flat|nested] 4+ messages in thread
* [PATCH v3 1/2] btrfs: do not clear page dirty inside extent_write_locked_range()
2024-03-07 3:16 [PATCH v3 0/2] btrfs: fix data corruption/rsv leak in subpage zoned cases Qu Wenruo
@ 2024-03-07 3:16 ` Qu Wenruo
2024-03-07 3:16 ` [PATCH v3 2/2] btrfs: make extent_write_locked_range() to handle subpage writeback correctly Qu Wenruo
2024-05-10 23:04 ` [PATCH v3 0/2] btrfs: fix data corruption/rsv leak in subpage zoned cases Qu Wenruo
2 siblings, 0 replies; 4+ messages in thread
From: Qu Wenruo @ 2024-03-07 3:16 UTC (permalink / raw)
To: linux-btrfs
[BUG]
For subpage + zoned case, the following workload can lead to rsv data
leak at unmount time:
# mkfs.btrfs -f -s 4k $dev
# mount $dev $mnt
# fsstress -w -n 8 -d $mnt -s 1709539240
0/0: fiemap - no filename
0/1: copyrange read - no filename
0/2: write - no filename
0/3: rename - no source filename
0/4: creat f0 x:0 0 0
0/4: creat add id=0,parent=-1
0/5: writev f0[259 1 0 0 0 0] [778052,113,965] 0
0/6: ioctl(FIEMAP) f0[259 1 0 0 224 887097] [1294220,2291618343991484791,0x10000] -1
0/7: dwrite - xfsctl(XFS_IOC_DIOINFO) f0[259 1 0 0 224 887097] return 25, fallback to stat()
0/7: dwrite f0[259 1 0 0 224 887097] [696320,102400] 0
# umount $mnt
The dmesg would include the following rsv leak detection wanring (all
call trace skipped):
------------[ cut here ]------------
WARNING: CPU: 2 PID: 4528 at fs/btrfs/inode.c:8653 btrfs_destroy_inode+0x1e0/0x200 [btrfs]
---[ end trace 0000000000000000 ]---
------------[ cut here ]------------
WARNING: CPU: 2 PID: 4528 at fs/btrfs/inode.c:8654 btrfs_destroy_inode+0x1a8/0x200 [btrfs]
---[ end trace 0000000000000000 ]---
------------[ cut here ]------------
WARNING: CPU: 2 PID: 4528 at fs/btrfs/inode.c:8660 btrfs_destroy_inode+0x1a0/0x200 [btrfs]
---[ end trace 0000000000000000 ]---
BTRFS info (device sda): last unmount of filesystem 1b4abba9-de34-4f07-9e7f-157cf12a18d6
------------[ cut here ]------------
WARNING: CPU: 3 PID: 4528 at fs/btrfs/block-group.c:4434 btrfs_free_block_groups+0x338/0x500 [btrfs]
---[ end trace 0000000000000000 ]---
BTRFS info (device sda): space_info DATA has 268218368 free, is not full
BTRFS info (device sda): space_info total=268435456, used=204800, pinned=0, reserved=0, may_use=12288, readonly=0 zone_unusable=0
BTRFS info (device sda): global_block_rsv: size 0 reserved 0
BTRFS info (device sda): trans_block_rsv: size 0 reserved 0
BTRFS info (device sda): chunk_block_rsv: size 0 reserved 0
BTRFS info (device sda): delayed_block_rsv: size 0 reserved 0
BTRFS info (device sda): delayed_refs_rsv: size 0 reserved 0
------------[ cut here ]------------
WARNING: CPU: 3 PID: 4528 at fs/btrfs/block-group.c:4434 btrfs_free_block_groups+0x338/0x500 [btrfs]
---[ end trace 0000000000000000 ]---
BTRFS info (device sda): space_info METADATA has 267796480 free, is not full
BTRFS info (device sda): space_info total=268435456, used=131072, pinned=0, reserved=0, may_use=262144, readonly=0 zone_unusable=245760
BTRFS info (device sda): global_block_rsv: size 0 reserved 0
BTRFS info (device sda): trans_block_rsv: size 0 reserved 0
BTRFS info (device sda): chunk_block_rsv: size 0 reserved 0
BTRFS info (device sda): delayed_block_rsv: size 0 reserved 0
BTRFS info (device sda): delayed_refs_rsv: size 0 reserved 0
Above $dev is a tcmu-runner emulated zoned HDD, which has a max zone
append size of 64K, and the system has 64K page size.
[CAUSE]
I have added several trace_printk() to show the events (header skipped):
> btrfs_dirty_pages: r/i=5/259 dirty start=774144 len=114688
> btrfs_dirty_pages: r/i=5/259 dirty part of page=720896 off_in_page=53248 len_in_page=12288
> btrfs_dirty_pages: r/i=5/259 dirty part of page=786432 off_in_page=0 len_in_page=65536
> btrfs_dirty_pages: r/i=5/259 dirty part of page=851968 off_in_page=0 len_in_page=36864
The above lines shows our buffered write has dirtied 3 pages of inode
259 of root 5:
704K 768K 832K 896K
| |////|/////////////////|///////////| |
756K 868K
|///| is the dirtied range.
> btrfs_direct_write: r/i=5/259 start dio filepos=696320 len=102400
Then direct IO write starts, since the range [680K, 780K) covers the
beginning part of the above dirty range, btrfs needs to writeback the
two pages at 704K and 768K.
> cow_file_range: r/i=5/259 add ordered extent filepos=774144 len=65536
> extent_write_locked_range: r/i=5/259 locked page=720896 start=774144 len=65536
Now the above 2 lines shows that, we're writing back for dirty range
[756K, 756K + 64K).
We only writeback 64K because the zoned device has max zone append size
as 64K.
> extent_write_locked_range: r/i=5/259 clear dirty for page=786432
!!! The above line shows the root cause. !!!
We're calling clear_page_dirty_for_io() inside extent_write_locked_range(),
for the page 768K.
After the writeback of range [756K, 820K), the dirty flags looks like
this:
704K 768K 832K 896K
| | | |/////////////| |
820K 868K
Note the range inside page 768K, we should still have dirty range [820K,
832K).
This means we will no longer writeback range [820K, 832K), thus the
reserved data/metadata space would never be properly released.
> extent_write_cache_pages: r/i=5/259 skip non-dirty folio=786432
Now even we try to start wrteiback for range [820K, 832K), since the
page is not dirty, we completely skip it.
> btrfs_direct_write: r/i=5/259 dio done filepos=696320 len=0
Now the direct IO finished.
> cow_file_range: r/i=5/259 add ordered extent filepos=851968 len=36864
> extent_write_locked_range: r/i=5/259 locked page=851968 start=851968 len=36864
Now we writeback the remaining dirty range, which is [832K, 868K).
This bug only affects subpage and zoned case.
For non-subpage and zoned case, find_next_dirty_byte() just return the
whole page no matter if it has dirty flags or not.
For subpage and non-zoned case, we never go into
extent_write_locked_range().
[FIX]
Just do not clear the page dirty at all.
As __extent_writepage_io() would do a more accurate, subpage compatible
clear for page dirty anyway.
Now the correct trace would look like this:
> btrfs_dirty_pages: r/i=5/259 dirty start=774144 len=114688
> btrfs_dirty_pages: r/i=5/259 dirty part of page=720896 off_in_page=53248 len_in_page=12288
> btrfs_dirty_pages: r/i=5/259 dirty part of page=786432 off_in_page=0 len_in_page=65536
> btrfs_dirty_pages: r/i=5/259 dirty part of page=851968 off_in_page=0 len_in_page=36864
The page dirty part is still the same 3 pages.
> btrfs_direct_write: r/i=5/259 start dio filepos=696320 len=102400
> cow_file_range: r/i=5/259 add ordered extent filepos=774144 len=65536
> extent_write_locked_range: r/i=5/259 locked page=720896 start=774144 len=65536
And the writeback for the first 64K is still correct.
> cow_file_range: r/i=5/259 add ordered extent filepos=839680 len=49152
> extent_write_locked_range: r/i=5/259 locked page=786432 start=839680 len=49152
Now with the fix, we can properly writeback the range [820K, 832K), and
properly release the reserved data/metadata space.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/extent_io.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index fb63055f42f3..bdd0e29ba848 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2290,10 +2290,8 @@ void extent_write_locked_range(struct inode *inode, struct page *locked_page,
page = find_get_page(mapping, cur >> PAGE_SHIFT);
ASSERT(PageLocked(page));
- if (pages_dirty && page != locked_page) {
+ if (pages_dirty && page != locked_page)
ASSERT(PageDirty(page));
- clear_page_dirty_for_io(page);
- }
ret = __extent_writepage_io(BTRFS_I(inode), page, cur, cur_len,
&bio_ctrl, i_size, &nr);
--
2.44.0
^ permalink raw reply related [flat|nested] 4+ messages in thread
* [PATCH v3 2/2] btrfs: make extent_write_locked_range() to handle subpage writeback correctly
2024-03-07 3:16 [PATCH v3 0/2] btrfs: fix data corruption/rsv leak in subpage zoned cases Qu Wenruo
2024-03-07 3:16 ` [PATCH v3 1/2] btrfs: do not clear page dirty inside extent_write_locked_range() Qu Wenruo
@ 2024-03-07 3:16 ` Qu Wenruo
2024-05-10 23:04 ` [PATCH v3 0/2] btrfs: fix data corruption/rsv leak in subpage zoned cases Qu Wenruo
2 siblings, 0 replies; 4+ messages in thread
From: Qu Wenruo @ 2024-03-07 3:16 UTC (permalink / raw)
To: linux-btrfs
When extent_write_locked_range() generated an inline extent, it would
set and finish the writeback for the whole page.
Although currently it's safe since subpage disables inline creation,
for the sake of consistency, let it go with subpage helpers to set and
clear the writeback flags.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/extent_io.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index bdd0e29ba848..0a194dd659e7 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2286,6 +2286,7 @@ void extent_write_locked_range(struct inode *inode, struct page *locked_page,
u64 cur_end = min(round_down(cur, PAGE_SIZE) + PAGE_SIZE - 1, end);
u32 cur_len = cur_end + 1 - cur;
struct page *page;
+ struct folio *folio;
int nr = 0;
page = find_get_page(mapping, cur >> PAGE_SHIFT);
@@ -2300,8 +2301,9 @@ void extent_write_locked_range(struct inode *inode, struct page *locked_page,
/* Make sure the mapping tag for page dirty gets cleared. */
if (nr == 0) {
- set_page_writeback(page);
- end_page_writeback(page);
+ folio = page_folio(page);
+ btrfs_folio_set_writeback(fs_info, folio, cur, cur_len);
+ btrfs_folio_clear_writeback(fs_info, folio, cur, cur_len);
}
if (ret) {
btrfs_mark_ordered_io_finished(BTRFS_I(inode), page,
--
2.44.0
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH v3 0/2] btrfs: fix data corruption/rsv leak in subpage zoned cases
2024-03-07 3:16 [PATCH v3 0/2] btrfs: fix data corruption/rsv leak in subpage zoned cases Qu Wenruo
2024-03-07 3:16 ` [PATCH v3 1/2] btrfs: do not clear page dirty inside extent_write_locked_range() Qu Wenruo
2024-03-07 3:16 ` [PATCH v3 2/2] btrfs: make extent_write_locked_range() to handle subpage writeback correctly Qu Wenruo
@ 2024-05-10 23:04 ` Qu Wenruo
2 siblings, 0 replies; 4+ messages in thread
From: Qu Wenruo @ 2024-05-10 23:04 UTC (permalink / raw)
To: linux-btrfs
在 2024/3/7 13:46, Qu Wenruo 写道:
> [CHANGELOG]
> v3:
> - Use the minimal fsstress workload with trace_printk() output to
> explain the bug better
>
> v2:
> - Update the commit message for the first patch
> As there is something wrong with the ASCII art of the memory layout.
>
> [REPO]
> https://github.com/adam900710/linux/tree/subpage_delalloc
Wait, it looks like there are more preparation patches not yet merged.
I'll rebase with all needed patches in one go.
Otherwise it looks like it's causing a lot of more problems reviewing.
Thanks,
Qu
>
> The repo includes the subpage delalloc rework, or subpage zoned won't
> work at all.
>
>
> Although my previous subpage delalloc rework fixes quite a lot of
> crashes of subpage + zoned support, it's still pretty easy to cause rsv
> leak using single thread fsstress.
>
> It turns out that, it's not a simple problem of some rsv leak, but
> certain dirty data ranges never got written back and just skipped with
> its dirty flags cleared, no wonder that would lead to rsv leak.
>
> The root cause is again in the extent_write_locked_range() function
> doing weird subpage incompatible behaviors, especially when it clears
> the page dirty flag for the whole page, causing __extent_writepage_io()
> unable to locate further dirty ranges to be submitted.
>
> The first patch would solve the problem, meanwhile for the 2nd patch
> it's a cleanup, as we will never hit the error for current subpage +
> zoned cases.
>
> Qu Wenruo (2):
> btrfs: do not clear page dirty inside extent_write_locked_range()
> btrfs: make extent_write_locked_range() to handle subpage writeback
> correctly
>
> fs/btrfs/extent_io.c | 10 +++++-----
> 1 file changed, 5 insertions(+), 5 deletions(-)
>
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2024-05-10 23:04 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-03-07 3:16 [PATCH v3 0/2] btrfs: fix data corruption/rsv leak in subpage zoned cases Qu Wenruo
2024-03-07 3:16 ` [PATCH v3 1/2] btrfs: do not clear page dirty inside extent_write_locked_range() Qu Wenruo
2024-03-07 3:16 ` [PATCH v3 2/2] btrfs: make extent_write_locked_range() to handle subpage writeback correctly Qu Wenruo
2024-05-10 23:04 ` [PATCH v3 0/2] btrfs: fix data corruption/rsv leak in subpage zoned cases Qu Wenruo
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox