[PATCH v2 0/2] btrfs: fix data corruption/hang/rsv leak in subpage zoned cases

Linux Btrfs filesystem development
 help / color / mirror / Atom feed

* [PATCH v2 0/2] btrfs: fix data corruption/hang/rsv leak in subpage zoned cases
@ 2024-03-05 22:35 Qu Wenruo
  2024-03-05 22:35 ` [PATCH v2 1/2] btrfs: do not clear page dirty inside extent_write_locked_range() Qu Wenruo
  2024-03-05 22:35 ` [PATCH v2 2/2] btrfs: make extent_write_locked_range() to handle subpage writeback correctly Qu Wenruo
  0 siblings, 2 replies; 6+ messages in thread
From: Qu Wenruo @ 2024-03-05 22:35 UTC (permalink / raw)
  To: linux-btrfs

[CHANGELOG]
v2:
- Update the commit message for the first patch
  As there is something wrong with the ASCII art of the memory layout.

[REPO]
https://github.com/adam900710/linux/tree/subpage_delalloc

The repo includes the subpage delalloc rework, or subpage zoned won't
work at all.

Although my previous subpage delalloc rework fixes quite a lot of
crashes of subpage + zoned support, it's still pretty easy to cause rsv
leak using single thread fsstress.

It turns out that, it's not a simple problem of some rsv leak, but
certain dirty data ranges never got written back and just skipped with
its dirty flags cleared, no wonder that would lead to rsv leak.

The root cause is again in the extent_write_locked_range() function
doing weird subpage incompatible behaviors, especially when it clears
the page dirty flag for the whole page, causing __extent_writepage_io()
unable to locate any dirty ranges to be submitted.

The first patch would solve the problem, meanwhile for the 2nd patch
it's a cleanup, as we will never hit the error for current subpage +
zoned cases.

Qu Wenruo (2):
  btrfs: do not clear page dirty inside extent_write_locked_range()
  btrfs: make extent_write_locked_range() to handle subpage writeback
    correctly

 fs/btrfs/extent_io.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

-- 
2.44.0

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v2 1/2] btrfs: do not clear page dirty inside extent_write_locked_range()
  2024-03-05 22:35 [PATCH v2 0/2] btrfs: fix data corruption/hang/rsv leak in subpage zoned cases Qu Wenruo
@ 2024-03-05 22:35 ` Qu Wenruo
  2024-05-10 14:25   ` Josef Bacik
  2024-03-05 22:35 ` [PATCH v2 2/2] btrfs: make extent_write_locked_range() to handle subpage writeback correctly Qu Wenruo
  1 sibling, 1 reply; 6+ messages in thread
From: Qu Wenruo @ 2024-03-05 22:35 UTC (permalink / raw)
  To: linux-btrfs

[BUG]
For subpage + zoned case, btrfs can easily hang with the following
workload, even with previous subpage delalloc rework:

 # mkfs.btrfs -s 4k -f $dev
 # mount $dev $mnt
 # xfs_io -f -c "pwrite 32k 128k" $mnt/foobar
 # umount $mnt

The system would hang at unmount due to unfinished ordered extents.

Above $dev is a tcmu-runner emulated zoned HDD, which has a max zone
append size of 64K, and the system has 64K page size.

[CAUSE]
There is a bug involved in extent_write_locked_range() (well, I'm
already surprised by how many subpage incompatible code are inside that
function):

- If @pages_dirty is true, we will clear the page dirty flag for the
  whole page

  This means, for above case, since the max zone append size is 64K,
  we got an ordered extent sized 64K, resulting the following writeback
  range:

  0               64K                 128K            192K
  |       |///////|///////////////////|/////////|
          32K               96K
           \       OE      /

  |///| = subpage dirty range

  And when we go into the __extent_writepage_io() call to submit [32K,
  64K), extent_write_locked_range() would find it's the locked page, and
  not clear its page dirty flag, so the submission go without any
  problem.

  But when we go into the [64K, 96K) range for the second half of the
  ordered extent, extent_write_locked_range() would clear the page dirty
  flag for the whole page range [64K, 128K), resulting the following
  layout:

  0               64K                 128K            192K
  |       |///////|         |         |/////////|
          32K               96K
           \       OE      /

  Then inside __extent_writepage_io(), since the page is no longer
  dirty, we skip the submission, causing only half of the ordered extent
  can be finished, thus hanging the whole system.

  Furthermore, this would cause more problems when we move to the next
  delalloc range [96K, 160K), as the original dirty range [96K, 128K)
  has its dirty flag cleared without releasing its data/metadata rsv, we
  would got rsv leak.

This bug only affects subpage and zoned case.
For non-subpage and zoned case, find_next_dirty_byte() just return the
whole page no matter if it has dirty flags or not.

For subpage and non-zoned case, we never go into
extent_write_locked_range().

[FIX]
Just do not clear the page dirty at all.
As __extent_writepage_io() would do a more accurate, subpage compatible
clear for page dirty anyway.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index fb63055f42f3..bdd0e29ba848 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2290,10 +2290,8 @@ void extent_write_locked_range(struct inode *inode, struct page *locked_page,

 		page = find_get_page(mapping, cur >> PAGE_SHIFT);
 		ASSERT(PageLocked(page));
-		if (pages_dirty && page != locked_page) {
+		if (pages_dirty && page != locked_page)
 			ASSERT(PageDirty(page));
-			clear_page_dirty_for_io(page);
-		}

 		ret = __extent_writepage_io(BTRFS_I(inode), page, cur, cur_len,
 					    &bio_ctrl, i_size, &nr);
-- 
2.44.0

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v2 2/2] btrfs: make extent_write_locked_range() to handle subpage writeback correctly
  2024-03-05 22:35 [PATCH v2 0/2] btrfs: fix data corruption/hang/rsv leak in subpage zoned cases Qu Wenruo
  2024-03-05 22:35 ` [PATCH v2 1/2] btrfs: do not clear page dirty inside extent_write_locked_range() Qu Wenruo
@ 2024-03-05 22:35 ` Qu Wenruo
  2024-05-10 14:25   ` Josef Bacik
  1 sibling, 1 reply; 6+ messages in thread
From: Qu Wenruo @ 2024-03-05 22:35 UTC (permalink / raw)
  To: linux-btrfs

When extent_write_locked_range() generated an inline extent, it would
set and finish the writeback for the whole page.

Although currently it's safe since subpage disables inline creation,
for the sake of consistency, let it go with subpage helpers to set and
clear the writeback flags.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index bdd0e29ba848..0a194dd659e7 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2286,6 +2286,7 @@ void extent_write_locked_range(struct inode *inode, struct page *locked_page,
 		u64 cur_end = min(round_down(cur, PAGE_SIZE) + PAGE_SIZE - 1, end);
 		u32 cur_len = cur_end + 1 - cur;
 		struct page *page;
+		struct folio *folio;
 		int nr = 0;
 
 		page = find_get_page(mapping, cur >> PAGE_SHIFT);
@@ -2300,8 +2301,9 @@ void extent_write_locked_range(struct inode *inode, struct page *locked_page,
 
 		/* Make sure the mapping tag for page dirty gets cleared. */
 		if (nr == 0) {
-			set_page_writeback(page);
-			end_page_writeback(page);
+			folio = page_folio(page);
+			btrfs_folio_set_writeback(fs_info, folio, cur, cur_len);
+			btrfs_folio_clear_writeback(fs_info, folio, cur, cur_len);
 		}
 		if (ret) {
 			btrfs_mark_ordered_io_finished(BTRFS_I(inode), page,
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 1/2] btrfs: do not clear page dirty inside extent_write_locked_range()
  2024-03-05 22:35 ` [PATCH v2 1/2] btrfs: do not clear page dirty inside extent_write_locked_range() Qu Wenruo
@ 2024-05-10 14:25   ` Josef Bacik
  2024-05-12  8:56     ` Qu Wenruo
  0 siblings, 1 reply; 6+ messages in thread
From: Josef Bacik @ 2024-05-10 14:25 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Wed, Mar 06, 2024 at 09:05:33AM +1030, Qu Wenruo wrote:
> [BUG]
> For subpage + zoned case, btrfs can easily hang with the following
> workload, even with previous subpage delalloc rework:
> 
>  # mkfs.btrfs -s 4k -f $dev
>  # mount $dev $mnt
>  # xfs_io -f -c "pwrite 32k 128k" $mnt/foobar
>  # umount $mnt
> 
> The system would hang at unmount due to unfinished ordered extents.
> 
> Above $dev is a tcmu-runner emulated zoned HDD, which has a max zone
> append size of 64K, and the system has 64K page size.
> 

This whole explanation confuses me, because in the normal case we clear the
whole page dirty before we enter into __extent_writepage_io(), so what we're
doing here is no different than the regular case, so I don't get how this is a
problem?

> [CAUSE]
> There is a bug involved in extent_write_locked_range() (well, I'm
> already surprised by how many subpage incompatible code are inside that
> function):
> 
> - If @pages_dirty is true, we will clear the page dirty flag for the
>   whole page
> 
>   This means, for above case, since the max zone append size is 64K,
>   we got an ordered extent sized 64K, resulting the following writeback
>   range:
> 
>   0               64K                 128K            192K
>   |       |///////|///////////////////|/////////|
>           32K               96K
>            \       OE      /
> 
>   |///| = subpage dirty range
> 
>   And when we go into the __extent_writepage_io() call to submit [32K,
>   64K), extent_write_locked_range() would find it's the locked page, and
>   not clear its page dirty flag, so the submission go without any
>   problem.
> 
>   But when we go into the [64K, 96K) range for the second half of the
>   ordered extent, extent_write_locked_range() would clear the page dirty
>   flag for the whole page range [64K, 128K), resulting the following
>   layout:
> 
>   0               64K                 128K            192K
>   |       |///////|         |         |/////////|
>           32K               96K
>            \       OE      /

Huh?  We come into extent_write_locked_range(), the first page is the locked
page so we don't clear dirty, because it's already been cleared dirty by the
original start from extent_write_cache_pages() right?  So we would have

Actual page
[0, 64k) !PageDirty()
[64k, 128k) PageDirty()

Subpage
[0, 64k) not dirty
[64k, 160k) dirty

> 
>   Then inside __extent_writepage_io(), since the page is no longer
>   dirty, we skip the submission, causing only half of the ordered extent
>   can be finished, thus hanging the whole system.

Huh? We _always_ call __extent_writepage_io() with the actual page not dirty, so
how are we skipping the submission?  We only clear the subpage range for the
part we actually submitted.

This needs a better explanation, because I'm super confused about how the page
state matters here when the main path does exactly this.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 2/2] btrfs: make extent_write_locked_range() to handle subpage writeback correctly
  2024-03-05 22:35 ` [PATCH v2 2/2] btrfs: make extent_write_locked_range() to handle subpage writeback correctly Qu Wenruo
@ 2024-05-10 14:25   ` Josef Bacik
  0 siblings, 0 replies; 6+ messages in thread
From: Josef Bacik @ 2024-05-10 14:25 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Wed, Mar 06, 2024 at 09:05:34AM +1030, Qu Wenruo wrote:
> When extent_write_locked_range() generated an inline extent, it would
> set and finish the writeback for the whole page.
> 
> Although currently it's safe since subpage disables inline creation,
> for the sake of consistency, let it go with subpage helpers to set and
> clear the writeback flags.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 1/2] btrfs: do not clear page dirty inside extent_write_locked_range()
  2024-05-10 14:25   ` Josef Bacik
@ 2024-05-12  8:56     ` Qu Wenruo
  0 siblings, 0 replies; 6+ messages in thread
From: Qu Wenruo @ 2024-05-12  8:56 UTC (permalink / raw)
  To: Josef Bacik, Qu Wenruo; +Cc: linux-btrfs



在 2024/5/10 23:55, Josef Bacik 写道:
> On Wed, Mar 06, 2024 at 09:05:33AM +1030, Qu Wenruo wrote:
>> [BUG]
>> For subpage + zoned case, btrfs can easily hang with the following
>> workload, even with previous subpage delalloc rework:
>>
>>   # mkfs.btrfs -s 4k -f $dev
>>   # mount $dev $mnt
>>   # xfs_io -f -c "pwrite 32k 128k" $mnt/foobar
>>   # umount $mnt
>>
>> The system would hang at unmount due to unfinished ordered extents.
>>
>> Above $dev is a tcmu-runner emulated zoned HDD, which has a max zone
>> append size of 64K, and the system has 64K page size.
>>
>
> This whole explanation confuses me, because in the normal case we clear the
> whole page dirty before we enter into __extent_writepage_io(), so what we're
> doing here is no different than the regular case, so I don't get how this is a
> problem?

Although I have sent out the updated (in fact just my github branch
version) version, I still like to explain it a little more, as there are
several more hidden (existing) problems here.

The example I would go with would be the same one page layout, with 64K
page size and 4K sector size:

     0      32K      64K     96K    128K    160K    192K
     I      |////////I//////////////I///////|       I

Where "I" is the page boundary.

Firstly, the biggest confusion comes from where we clear the PageDirty.
We do it in several locations:

- folio_clear_dirty_for_io() inside extent_write_cache_pages()
   I'd say, this is the worst location to call.
   It is not subpage aware, nor at the right timing.

   The only good news is, it's not causing real problems, thus I'll
   clean it up later.

- btrfs_folio_clear_dirty() inside __extent_writepage_io()
   This is subpage aware, and timing correct after we have set the
   (sub)page range writeback.

- clear_page_dirty_for_io() inside extent_write_locked_range()
   The offending part we're discussing here.


For the "normal" subpage case, we go the following sequence:

1. folio_clear_dirty_for_io() on page [0, 64K)
    As I said already, this is not subpage aware nor timing correctly.

2. writepage_delalloc() to allocate an OE for range [32K, 160K)

3. __extent_writepage_io() for page [0, 64K)
    Which would then search subpage bits (not yet cleared), and submit
    range [32K, 64K) correctly.
    And correctly clear the subpage dirty flags.

4. folio_clear_dirty_for_io() on page [64K, 128K)

5. __extent_writepage_io() for page [64K, 128K)
    Now this time it would submit [64K, 128K) for the range.
    And correctly clear the subpage dirty flags.

6. folio_clear_dirty_for_io() on page [128K, 160K)

7. __extent_writepage_io() for page [128K, 160K)
    Now this time it would submit [128K, 160K) for the range.
    And correctly clear the subpage dirty flags.

The point here is, __extent_writepage_io() is only called for the exact
page from extent_write_cache_pages(). It would never be called on other
pages.

But for the subpage + zoned case, things can go a little different:

1. folio_clear_dirty_for_io() on page [0, 64K)
    This is still the same.

2. writepage_delalloc() to allocate an OE for range [32K, 96K)
    As our zoned append size limit is 64K.

3. __extent_writepage_io() for page [0, 64K)
    So far still no difference, submitted range [32K, 64K)

4. clear_page_dirty_for_io() for page [64K, 128K) inside
    extent_write_locked_range().
    This is different now. We are handling page other than the locked
    page.
    This would only cause problem later.

5. __extent_writepage_io() for page [64K, 128K)
    Now we only submit the range [64K, 96K), as our range is limited by
    the extent_write_locked_range().

    At this page, page dirty is cleared, but subpage dirty is only
    cleared for [64K, 96K), the remaining half still has subpage dirty
    flags. AKA, a desync between page flags and subpage flags.

6. folio_test_dirty() for page [64K, 128K) from
    extent_write_cache_pages()
    Now since that page no longer has dirty flags (even it still has
    subpage flags), extent_write_cache_pages() would simply skip this
    page, causing the range [96K, 128K) not submitted.

That's why this patch is required to fix the problem.

I hope my newer patch would be enough to explain it already, as it
explicitly showed the step 6) using a lot of trace_printk().

And if you find the explanation here is better, I'm pretty happy to
merge the explanation into the v4 patchset.

And thanks to your questions, I'm more confident to do further cleanup
on the page dirty handling.

Thanks,
Qu

>
>> [CAUSE]
>> There is a bug involved in extent_write_locked_range() (well, I'm
>> already surprised by how many subpage incompatible code are inside that
>> function):
>>
>> - If @pages_dirty is true, we will clear the page dirty flag for the
>>    whole page
>>
>>    This means, for above case, since the max zone append size is 64K,
>>    we got an ordered extent sized 64K, resulting the following writeback
>>    range:
>>
>>    0               64K                 128K            192K
>>    |       |///////|///////////////////|/////////|
>>            32K               96K
>>             \       OE      /
>>
>>    |///| = subpage dirty range
>>
>>    And when we go into the __extent_writepage_io() call to submit [32K,
>>    64K), extent_write_locked_range() would find it's the locked page, and
>>    not clear its page dirty flag, so the submission go without any
>>    problem.
>>
>>    But when we go into the [64K, 96K) range for the second half of the
>>    ordered extent, extent_write_locked_range() would clear the page dirty
>>    flag for the whole page range [64K, 128K), resulting the following
>>    layout:
>>
>>    0               64K                 128K            192K
>>    |       |///////|         |         |/////////|
>>            32K               96K
>>             \       OE      /
>
> Huh?  We come into extent_write_locked_range(), the first page is the locked
> page so we don't clear dirty, because it's already been cleared dirty by the
> original start from extent_write_cache_pages() right?  So we would have
>
> Actual page
> [0, 64k) !PageDirty()
> [64k, 128k) PageDirty()
>
> Subpage
> [0, 64k) not dirty
> [64k, 160k) dirty
>
>>
>>    Then inside __extent_writepage_io(), since the page is no longer
>>    dirty, we skip the submission, causing only half of the ordered extent
>>    can be finished, thus hanging the whole system.
>
> Huh? We _always_ call __extent_writepage_io() with the actual page not dirty, so
> how are we skipping the submission?  We only clear the subpage range for the
> part we actually submitted.
>
> This needs a better explanation, because I'm super confused about how the page
> state matters here when the main path does exactly this.  Thanks,
>
> Josef
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2024-05-12  8:56 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-03-05 22:35 [PATCH v2 0/2] btrfs: fix data corruption/hang/rsv leak in subpage zoned cases Qu Wenruo
2024-03-05 22:35 ` [PATCH v2 1/2] btrfs: do not clear page dirty inside extent_write_locked_range() Qu Wenruo
2024-05-10 14:25   ` Josef Bacik
2024-05-12  8:56     ` Qu Wenruo
2024-03-05 22:35 ` [PATCH v2 2/2] btrfs: make extent_write_locked_range() to handle subpage writeback correctly Qu Wenruo
2024-05-10 14:25   ` Josef Bacik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox