[PATCH] btrfs: use FGP_STABLE to wait for folio writeback

Linux Btrfs filesystem development
 help / color / mirror / Atom feed

* [PATCH] btrfs: use FGP_STABLE to wait for folio writeback
@ 2024-10-14  4:36 Qu Wenruo
  2024-10-14 11:19 ` Anand Jain
  2024-10-14 14:16 ` David Sterba
  0 siblings, 2 replies; 9+ messages in thread
From: Qu Wenruo @ 2024-10-14  4:36 UTC (permalink / raw)
  To: linux-btrfs

__filemap_get_folio() provides the flag FGP_STABLE to wait for
writeback.

There are two call sites doing __filemap_get_folio() then
folio_wait_writeback():

- btrfs_truncate_block()
- defrag_prepare_one_folio()

We can directly utilize that flag instead of manually calling
folio_wait_writeback().

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/defrag.c | 4 +---
 fs/btrfs/inode.c  | 4 +---
 2 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/defrag.c b/fs/btrfs/defrag.c
index b95ef44c326b..1644470b9df7 100644
--- a/fs/btrfs/defrag.c
+++ b/fs/btrfs/defrag.c
@@ -865,7 +865,7 @@ static struct folio *defrag_prepare_one_folio(struct btrfs_inode *inode, pgoff_t
 
 again:
 	folio = __filemap_get_folio(mapping, index,
-				    FGP_LOCK | FGP_ACCESSED | FGP_CREAT, mask);
+			FGP_LOCK | FGP_ACCESSED | FGP_CREAT | FGP_STABLE, mask);
 	if (IS_ERR(folio))
 		return folio;
 
@@ -1222,8 +1222,6 @@ static int defrag_one_range(struct btrfs_inode *inode, u64 start, u32 len,
 			goto free_folios;
 		}
 	}
-	for (i = 0; i < nr_pages; i++)
-		folio_wait_writeback(folios[i]);
 
 	/* Lock the pages range */
 	lock_extent(&inode->io_tree, start_index << PAGE_SHIFT,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 9ca74e5e7aa6..a21701571cbb 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4743,7 +4743,7 @@ int btrfs_truncate_block(struct btrfs_inode *inode, loff_t from, loff_t len,
 	}
 again:
 	folio = __filemap_get_folio(mapping, index,
-				    FGP_LOCK | FGP_ACCESSED | FGP_CREAT, mask);
+			FGP_LOCK | FGP_ACCESSED | FGP_CREAT | FGP_STABLE, mask);
 	if (IS_ERR(folio)) {
 		btrfs_delalloc_release_space(inode, data_reserved, block_start,
 					     blocksize, true);
@@ -4776,8 +4776,6 @@ int btrfs_truncate_block(struct btrfs_inode *inode, loff_t from, loff_t len,
 	if (ret < 0)
 		goto out_unlock;
 
-	folio_wait_writeback(folio);
-
 	lock_extent(io_tree, block_start, block_end, &cached_state);
 
 	ordered = btrfs_lookup_ordered_extent(inode, block_start);
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] btrfs: use FGP_STABLE to wait for folio writeback
  2024-10-14  4:36 [PATCH] btrfs: use FGP_STABLE to wait for folio writeback Qu Wenruo
@ 2024-10-14 11:19 ` Anand Jain
  2024-10-14 14:16 ` David Sterba
  1 sibling, 0 replies; 9+ messages in thread
From: Anand Jain @ 2024-10-14 11:19 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

LGTM

Reviewed-by: Anand Jain <anand.jain@oracle.com>

Thx


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] btrfs: use FGP_STABLE to wait for folio writeback
  2024-10-14  4:36 [PATCH] btrfs: use FGP_STABLE to wait for folio writeback Qu Wenruo
  2024-10-14 11:19 ` Anand Jain
@ 2024-10-14 14:16 ` David Sterba
  2024-10-14 20:55   ` Qu Wenruo
  1 sibling, 1 reply; 9+ messages in thread
From: David Sterba @ 2024-10-14 14:16 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Mon, Oct 14, 2024 at 03:06:31PM +1030, Qu Wenruo wrote:
> __filemap_get_folio() provides the flag FGP_STABLE to wait for
> writeback.
> 
> There are two call sites doing __filemap_get_folio() then
> folio_wait_writeback():
> 
> - btrfs_truncate_block()
> - defrag_prepare_one_folio()
> 
> We can directly utilize that flag instead of manually calling
> folio_wait_writeback().

We can do that but I'm missing a justification for that. The explicit
writeback calls are done at a different points than what FGP_STABLE
does. So what's the difference?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] btrfs: use FGP_STABLE to wait for folio writeback
  2024-10-14 14:16 ` David Sterba
@ 2024-10-14 20:55   ` Qu Wenruo
  2024-10-14 20:59     ` Qu Wenruo
  2024-10-15  0:31     ` David Sterba
  0 siblings, 2 replies; 9+ messages in thread
From: Qu Wenruo @ 2024-10-14 20:55 UTC (permalink / raw)
  To: dsterba, Qu Wenruo; +Cc: linux-btrfs



在 2024/10/15 00:46, David Sterba 写道:
> On Mon, Oct 14, 2024 at 03:06:31PM +1030, Qu Wenruo wrote:
>> __filemap_get_folio() provides the flag FGP_STABLE to wait for
>> writeback.
>>
>> There are two call sites doing __filemap_get_folio() then
>> folio_wait_writeback():
>>
>> - btrfs_truncate_block()
>> - defrag_prepare_one_folio()
>>
>> We can directly utilize that flag instead of manually calling
>> folio_wait_writeback().
>
> We can do that but I'm missing a justification for that. The explicit
> writeback calls are done at a different points than what FGP_STABLE
> does. So what's the difference?
>

TL;DR, it's not safe to read folio before waiting for writeback in theory.

There is a possible race, mostly related to my previous attempt of
subpage partial uptodate support:

              Thread A          |           Thread B
-------------------------------+-----------------------------
extent_writepage_io()          |
|- submit_one_sector()         |
   |- folio_set_writeback()     |
      The folio is partial dirty|
      and uninvolved sectors are|
      not uptodate              |
                                | btrfs_truncate_block()
                                | |- btrfs_do_readpage()
                                |   |- submit_one_folio
                                |      This will read all sectors
                                |      from disk, but that writeback
                                |      sector is not yet finished

In this case, we can read out garbage from disk, since the write is not
yet finished.

This is not yet possible, because we always read out the whole page so
in that case thread B won't trigger a read.


But this already shows the way we wait for writeback is not safe.
And that's why no other filesystems manually wait for writeback after
reading the folio.

Thus I think doing FGP_STABLE is way more safer, and that's why all
other fses are doing this way instead.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] btrfs: use FGP_STABLE to wait for folio writeback
  2024-10-14 20:55   ` Qu Wenruo
@ 2024-10-14 20:59     ` Qu Wenruo
  2024-10-15  0:31     ` David Sterba
  1 sibling, 0 replies; 9+ messages in thread
From: Qu Wenruo @ 2024-10-14 20:59 UTC (permalink / raw)
  To: dsterba, Qu Wenruo; +Cc: linux-btrfs



在 2024/10/15 07:25, Qu Wenruo 写道:
> 
> 
> 在 2024/10/15 00:46, David Sterba 写道:
>> On Mon, Oct 14, 2024 at 03:06:31PM +1030, Qu Wenruo wrote:
>>> __filemap_get_folio() provides the flag FGP_STABLE to wait for
>>> writeback.
>>>
>>> There are two call sites doing __filemap_get_folio() then
>>> folio_wait_writeback():
>>>
>>> - btrfs_truncate_block()
>>> - defrag_prepare_one_folio()
>>>
>>> We can directly utilize that flag instead of manually calling
>>> folio_wait_writeback().
>>
>> We can do that but I'm missing a justification for that. The explicit
>> writeback calls are done at a different points than what FGP_STABLE
>> does. So what's the difference?
>>
> 
> TL;DR, it's not safe to read folio before waiting for writeback in theory.
> 
> There is a possible race, mostly related to my previous attempt of
> subpage partial uptodate support:
> 
>               Thread A          |           Thread B
> -------------------------------+-----------------------------
> extent_writepage_io()          |
> |- submit_one_sector()         |
>    |- folio_set_writeback()     |
>       The folio is partial dirty|
>       and uninvolved sectors are|
>       not uptodate              |
>                                 | btrfs_truncate_block()
>                                 | |- btrfs_do_readpage()
>                                 |   |- submit_one_folio
>                                 |      This will read all sectors
>                                 |      from disk, but that writeback
>                                 |      sector is not yet finished
> 
> In this case, we can read out garbage from disk, since the write is not
> yet finished.
> 
> This is not yet possible, because we always read out the whole page so
> in that case thread B won't trigger a read.

Here I mean at page dirty time, we always read out the whole page if the 
write range is not page aligned.

So we won't have partial page uptodate yet.

But the race is still here.

> 
> 
> But this already shows the way we wait for writeback is not safe.
> And that's why no other filesystems manually wait for writeback after
> reading the folio.
> 
> Thus I think doing FGP_STABLE is way more safer, and that's why all
> other fses are doing this way instead.
> 
> Thanks,
> Qu
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] btrfs: use FGP_STABLE to wait for folio writeback
  2024-10-14 20:55   ` Qu Wenruo
  2024-10-14 20:59     ` Qu Wenruo
@ 2024-10-15  0:31     ` David Sterba
  2024-10-15  1:26       ` Qu Wenruo
  1 sibling, 1 reply; 9+ messages in thread
From: David Sterba @ 2024-10-15  0:31 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Qu Wenruo, linux-btrfs

On Tue, Oct 15, 2024 at 07:25:20AM +1030, Qu Wenruo wrote:
> 在 2024/10/15 00:46, David Sterba 写道:
> > On Mon, Oct 14, 2024 at 03:06:31PM +1030, Qu Wenruo wrote:
> >> __filemap_get_folio() provides the flag FGP_STABLE to wait for
> >> writeback.
> >>
> >> There are two call sites doing __filemap_get_folio() then
> >> folio_wait_writeback():
> >>
> >> - btrfs_truncate_block()
> >> - defrag_prepare_one_folio()
> >>
> >> We can directly utilize that flag instead of manually calling
> >> folio_wait_writeback().
> >
> > We can do that but I'm missing a justification for that. The explicit
> > writeback calls are done at a different points than what FGP_STABLE
> > does. So what's the difference?
> >
> 
> TL;DR, it's not safe to read folio before waiting for writeback in theory.
> 
> There is a possible race, mostly related to my previous attempt of
> subpage partial uptodate support:
> 
>               Thread A          |           Thread B
> -------------------------------+-----------------------------
> extent_writepage_io()          |
> |- submit_one_sector()         |
>    |- folio_set_writeback()     |
>       The folio is partial dirty|
>       and uninvolved sectors are|
>       not uptodate              |
>                                 | btrfs_truncate_block()
>                                 | |- btrfs_do_readpage()
>                                 |   |- submit_one_folio
>                                 |      This will read all sectors
>                                 |      from disk, but that writeback
>                                 |      sector is not yet finished
> 
> In this case, we can read out garbage from disk, since the write is not
> yet finished.
> 
> This is not yet possible, because we always read out the whole page so
> in that case thread B won't trigger a read.
> 
> But this already shows the way we wait for writeback is not safe.
> And that's why no other filesystems manually wait for writeback after
> reading the folio.
> 
> Thus I think doing FGP_STABLE is way more safer, and that's why all
> other fses are doing this way instead.

I'm not disputing we need it and I may be missing something, what I
noticed in the patch is maybe a generic pattern, structure read at some
time and then synced/written, but there could be some change in
bettween.  One example is one you show (theoretically or not).

The writeback is a kind of synchronization point, but also in parallel
with the data/metadata in page cache. If the state, regarding writeback,
is not safe and we can either get old data or could get partially synced
data (ie. ok in page cache but not regarding writeback) it is a
problematic pattern.

You found two cases, truncate and defrag. Both are different I think,
truncate comes from normal MM operations, while defrag is triggered by
an ioctl (still trying to be in sync with MM).

I'm not sure we can copy what other filesystems do, even if it's just on
the basic principle of COW vs update in place + journaling. We copy the
and do the next update and don't have to care about previous state, so
even a split between read and writeback does no harm.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] btrfs: use FGP_STABLE to wait for folio writeback
  2024-10-15  0:31     ` David Sterba
@ 2024-10-15  1:26       ` Qu Wenruo
  2024-10-15  4:04         ` Anand Jain
  0 siblings, 1 reply; 9+ messages in thread
From: Qu Wenruo @ 2024-10-15  1:26 UTC (permalink / raw)
  To: dsterba, Qu Wenruo; +Cc: linux-btrfs

在 2024/10/15 11:01, David Sterba 写道:
> On Tue, Oct 15, 2024 at 07:25:20AM +1030, Qu Wenruo wrote:
>> 在 2024/10/15 00:46, David Sterba 写道:
>>> On Mon, Oct 14, 2024 at 03:06:31PM +1030, Qu Wenruo wrote:
>>>> __filemap_get_folio() provides the flag FGP_STABLE to wait for
>>>> writeback.
>>>>
>>>> There are two call sites doing __filemap_get_folio() then
>>>> folio_wait_writeback():
>>>>
>>>> - btrfs_truncate_block()
>>>> - defrag_prepare_one_folio()
>>>>
>>>> We can directly utilize that flag instead of manually calling
>>>> folio_wait_writeback().
>>>
>>> We can do that but I'm missing a justification for that. The explicit
>>> writeback calls are done at a different points than what FGP_STABLE
>>> does. So what's the difference?
>>>
>>
>> TL;DR, it's not safe to read folio before waiting for writeback in theory.
>>
>> There is a possible race, mostly related to my previous attempt of
>> subpage partial uptodate support:
>>
>>                Thread A          |           Thread B
>> -------------------------------+-----------------------------
>> extent_writepage_io()          |
>> |- submit_one_sector()         |
>>     |- folio_set_writeback()     |
>>        The folio is partial dirty|
>>        and uninvolved sectors are|
>>        not uptodate              |
>>                                  | btrfs_truncate_block()
>>                                  | |- btrfs_do_readpage()
>>                                  |   |- submit_one_folio
>>                                  |      This will read all sectors
>>                                  |      from disk, but that writeback
>>                                  |      sector is not yet finished
>>
>> In this case, we can read out garbage from disk, since the write is not
>> yet finished.
>>
>> This is not yet possible, because we always read out the whole page so
>> in that case thread B won't trigger a read.
>>
>> But this already shows the way we wait for writeback is not safe.
>> And that's why no other filesystems manually wait for writeback after
>> reading the folio.
>>
>> Thus I think doing FGP_STABLE is way more safer, and that's why all
>> other fses are doing this way instead.
> 
> I'm not disputing we need it and I may be missing something, what I
> noticed in the patch is maybe a generic pattern, structure read at some
> time and then synced/written, but there could be some change in
> bettween.  One example is one you show (theoretically or not).
> 
> The writeback is a kind of synchronization point, but also in parallel
> with the data/metadata in page cache. If the state, regarding writeback,
> is not safe and we can either get old data or could get partially synced
> data (ie. ok in page cache but not regarding writeback) it is a
> problematic pattern.

The writeback is a sync point, but it's more like an optimization to 
reduce page lock contention.

E.g. when a page contains several blocks, and some blocks are dirty and 
being written back, but also some sectors needs to be read.
If implemented properly, the not uptodate blocks can be properly read 
meanwhile without waiting for the writeback to finish.

But from the safety point of view, I strongly prefer to wait for the 
folio writeback in such case.

Especially considering all the existing get folio + read + wait for 
writeback is from the time where we only consider sectorsize == page size.

We have enabled sectorsize < page size since 5.15, and we should have 
dropped the wrong assumption for years.

> 
> You found two cases, truncate and defrag. Both are different I think,
> truncate comes from normal MM operations, while defrag is triggered by
> an ioctl (still trying to be in sync with MM).
> 
> I'm not sure we can copy what other filesystems do, even if it's just on
> the basic principle of COW vs update in place + journaling.

I do not think COW is making any difference. All the COW handling is at 
delalloc time, meanwhile the folio get/lock/read/wait sequence is more 
common for page dirtying (regular buffered write, defrag, truncate are 
all in a similar situation).

Page cache is here just to provide file content cache, it doesn't care 
about if it's COW or NOT.

Furthermore, COW is no longer our exclusive feature, XFS has supported 
it for quite some time, and there is no special handling just for the 
page cache.
(Meanwhile XFS and ext4 has much better blocksize < page size handling 
than us for years)

And I have already explained in that case, waiting for the writeback at 
folio get time is much safer (although reduces concurrency).

Just for the data safety, I believe you need to provide a much stronger 
argument than COW vs overwrite (which is completely unrelated).

> We copy the
> and do the next update and don't have to care about previous state, so
> even a split between read and writeback does no harm.

Although I love the csum/datacow, I do not see any strong reason not to 
follow the more common (and IMHO safer) way to wait for the writeback.

I have explained the possible race, and I do not think I need to repeat 
that example again.

If you didn't fully understand why my example, please let me know where 
it's unclear that I can explain it better.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] btrfs: use FGP_STABLE to wait for folio writeback
  2024-10-15  1:26       ` Qu Wenruo
@ 2024-10-15  4:04         ` Anand Jain
  2024-10-15  4:41           ` Qu Wenruo
  0 siblings, 1 reply; 9+ messages in thread
From: Anand Jain @ 2024-10-15  4:04 UTC (permalink / raw)
  To: Qu Wenruo, dsterba, Qu Wenruo; +Cc: linux-btrfs

On 15/10/24 09:26, Qu Wenruo wrote:
> 
> 
> 在 2024/10/15 11:01, David Sterba 写道:
>> On Tue, Oct 15, 2024 at 07:25:20AM +1030, Qu Wenruo wrote:
>>> 在 2024/10/15 00:46, David Sterba 写道:
>>>> On Mon, Oct 14, 2024 at 03:06:31PM +1030, Qu Wenruo wrote:
>>>>> __filemap_get_folio() provides the flag FGP_STABLE to wait for
>>>>> writeback.
>>>>>
>>>>> There are two call sites doing __filemap_get_folio() then
>>>>> folio_wait_writeback():
>>>>>
>>>>> - btrfs_truncate_block()
>>>>> - defrag_prepare_one_folio()
>>>>>
>>>>> We can directly utilize that flag instead of manually calling
>>>>> folio_wait_writeback().
>>>>
>>>> We can do that but I'm missing a justification for that. The explicit
>>>> writeback calls are done at a different points than what FGP_STABLE
>>>> does. So what's the difference?
>>>>
>>>
>>> TL;DR, it's not safe to read folio before waiting for writeback in 
>>> theory.
>>>
>>> There is a possible race, mostly related to my previous attempt of
>>> subpage partial uptodate support:
>>>
>>>                Thread A          |           Thread B
>>> -------------------------------+-----------------------------
>>> extent_writepage_io()          |
>>> |- submit_one_sector()         |
>>>     |- folio_set_writeback()     |
>>>        The folio is partial dirty|
>>>        and uninvolved sectors are|
>>>        not uptodate              |
>>>                                  | btrfs_truncate_block()
>>>                                  | |- btrfs_do_readpage()
>>>                                  |   |- submit_one_folio
>>>                                  |      This will read all sectors
>>>                                  |      from disk, but that writeback
>>>                                  |      sector is not yet finished
>>>
>>> In this case, we can read out garbage from disk, since the write is not
>>> yet finished.
>>>
>>> This is not yet possible, because we always read out the whole page so
>>> in that case thread B won't trigger a read.
>>>
>>> But this already shows the way we wait for writeback is not safe.
>>> And that's why no other filesystems manually wait for writeback after
>>> reading the folio.
>>>
>>> Thus I think doing FGP_STABLE is way more safer, and that's why all
>>> other fses are doing this way instead.
>>
>> I'm not disputing we need it and I may be missing something, what I
>> noticed in the patch is maybe a generic pattern, structure read at some
>> time and then synced/written, but there could be some change in
>> bettween.  One example is one you show (theoretically or not).
>>
>> The writeback is a kind of synchronization point, but also in parallel
>> with the data/metadata in page cache. If the state, regarding writeback,
>> is not safe and we can either get old data or could get partially synced
>> data (ie. ok in page cache but not regarding writeback) it is a
>> problematic pattern.
> 
> The writeback is a sync point, but it's more like an optimization to 
> reduce page lock contention.
> 

The commit/ref below avoids using %FGP_STABLE (and possibly
%FGP_WRITEBEGIN) in f2fs due to a potential deadlock in the
write-back code, but I'm unsure how. The reasoning isn't clear.
The two changes in our case aren't in the write-back path,
though. On the 2nd thought, any idea if a similar deadlock would
apply to our case in your pov?


Ref:
----
   commit dfd2e81d37e1 ("f2fs: Convert f2fs_write_begin() to use a folio")

   %FGP_STABLE - Wait for the folio to be stable (finished writeback)

    #define FGP_WRITEBEGIN (FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE)
----

Thanks, Anand


> E.g. when a page contains several blocks, and some blocks are dirty and 
> being written back, but also some sectors needs to be read.
> If implemented properly, the not uptodate blocks can be properly read 
> meanwhile without waiting for the writeback to finish.
> 
> But from the safety point of view, I strongly prefer to wait for the 
> folio writeback in such case.
> 
> Especially considering all the existing get folio + read + wait for 
> writeback is from the time where we only consider sectorsize == page size.
> 
> We have enabled sectorsize < page size since 5.15, and we should have 
> dropped the wrong assumption for years.
> 
>>
>> You found two cases, truncate and defrag. Both are different I think,
>> truncate comes from normal MM operations, while defrag is triggered by
>> an ioctl (still trying to be in sync with MM).
>>
>> I'm not sure we can copy what other filesystems do, even if it's just on
>> the basic principle of COW vs update in place + journaling.
> 
> I do not think COW is making any difference. All the COW handling is at 
> delalloc time, meanwhile the folio get/lock/read/wait sequence is more 
> common for page dirtying (regular buffered write, defrag, truncate are 
> all in a similar situation).
> 
> Page cache is here just to provide file content cache, it doesn't care 
> about if it's COW or NOT.
> 
> Furthermore, COW is no longer our exclusive feature, XFS has supported 
> it for quite some time, and there is no special handling just for the 
> page cache.
> (Meanwhile XFS and ext4 has much better blocksize < page size handling 
> than us for years)
> 
> 
> And I have already explained in that case, waiting for the writeback at 
> folio get time is much safer (although reduces concurrency).
> 
> Just for the data safety, I believe you need to provide a much stronger 
> argument than COW vs overwrite (which is completely unrelated).
> 
>> We copy the
>> and do the next update and don't have to care about previous state, so
>> even a split between read and writeback does no harm.
> 
> Although I love the csum/datacow, I do not see any strong reason not to 
> follow the more common (and IMHO safer) way to wait for the writeback.
> 
> I have explained the possible race, and I do not think I need to repeat 
> that example again.
> 
> If you didn't fully understand why my example, please let me know where 
> it's unclear that I can explain it better.
> 
> Thanks,
> Qu


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] btrfs: use FGP_STABLE to wait for folio writeback
  2024-10-15  4:04         ` Anand Jain
@ 2024-10-15  4:41           ` Qu Wenruo
  0 siblings, 0 replies; 9+ messages in thread
From: Qu Wenruo @ 2024-10-15  4:41 UTC (permalink / raw)
  To: Anand Jain, Qu Wenruo, dsterba; +Cc: linux-btrfs



在 2024/10/15 14:34, Anand Jain 写道:
> On 15/10/24 09:26, Qu Wenruo wrote:
>>
>>
>> 在 2024/10/15 11:01, David Sterba 写道:
>>> On Tue, Oct 15, 2024 at 07:25:20AM +1030, Qu Wenruo wrote:
>>>> 在 2024/10/15 00:46, David Sterba 写道:
>>>>> On Mon, Oct 14, 2024 at 03:06:31PM +1030, Qu Wenruo wrote:
>>>>>> __filemap_get_folio() provides the flag FGP_STABLE to wait for
>>>>>> writeback.
>>>>>>
>>>>>> There are two call sites doing __filemap_get_folio() then
>>>>>> folio_wait_writeback():
>>>>>>
>>>>>> - btrfs_truncate_block()
>>>>>> - defrag_prepare_one_folio()
>>>>>>
>>>>>> We can directly utilize that flag instead of manually calling
>>>>>> folio_wait_writeback().
>>>>>
>>>>> We can do that but I'm missing a justification for that. The explicit
>>>>> writeback calls are done at a different points than what FGP_STABLE
>>>>> does. So what's the difference?
>>>>>
>>>>
>>>> TL;DR, it's not safe to read folio before waiting for writeback in
>>>> theory.
>>>>
>>>> There is a possible race, mostly related to my previous attempt of
>>>> subpage partial uptodate support:
>>>>
>>>>                Thread A          |           Thread B
>>>> -------------------------------+-----------------------------
>>>> extent_writepage_io()          |
>>>> |- submit_one_sector()         |
>>>>     |- folio_set_writeback()     |
>>>>        The folio is partial dirty|
>>>>        and uninvolved sectors are|
>>>>        not uptodate              |
>>>>                                  | btrfs_truncate_block()
>>>>                                  | |- btrfs_do_readpage()
>>>>                                  |   |- submit_one_folio
>>>>                                  |      This will read all sectors
>>>>                                  |      from disk, but that writeback
>>>>                                  |      sector is not yet finished
>>>>
>>>> In this case, we can read out garbage from disk, since the write is not
>>>> yet finished.
>>>>
>>>> This is not yet possible, because we always read out the whole page so
>>>> in that case thread B won't trigger a read.
>>>>
>>>> But this already shows the way we wait for writeback is not safe.
>>>> And that's why no other filesystems manually wait for writeback after
>>>> reading the folio.
>>>>
>>>> Thus I think doing FGP_STABLE is way more safer, and that's why all
>>>> other fses are doing this way instead.
>>>
>>> I'm not disputing we need it and I may be missing something, what I
>>> noticed in the patch is maybe a generic pattern, structure read at some
>>> time and then synced/written, but there could be some change in
>>> bettween.  One example is one you show (theoretically or not).
>>>
>>> The writeback is a kind of synchronization point, but also in parallel
>>> with the data/metadata in page cache. If the state, regarding writeback,
>>> is not safe and we can either get old data or could get partially synced
>>> data (ie. ok in page cache but not regarding writeback) it is a
>>> problematic pattern.
>>
>> The writeback is a sync point, but it's more like an optimization to
>> reduce page lock contention.
>>
>
> The commit/ref below avoids using %FGP_STABLE (and possibly
> %FGP_WRITEBEGIN) in f2fs due to a potential deadlock in the
> write-back code, but I'm unsure how. The reasoning isn't clear.
> The two changes in our case aren't in the write-back path,
> though. On the 2nd thought, any idea if a similar deadlock would
> apply to our case in your pov?

Not an expert on f2fs, but from the function
f2fs_wait_one_page_writeback() it will do extra bio submission and merge.

So my current guess is, there may be some page marked writeback, but not
  yet submitted related to F2FS specific behaviors.

If that's the case, f2fs will cause deadlock where it didn't submit but
wait for those writeback, causing the deadlock.

Thankfully for btrfs we do not have such cases thus it doesn't apply to us.

Thanks,
Qu
>
>
> Ref:
> ----
>    commit dfd2e81d37e1 ("f2fs: Convert f2fs_write_begin() to use a folio")
>
>    %FGP_STABLE - Wait for the folio to be stable (finished writeback)
>
>     #define FGP_WRITEBEGIN (FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE)
> ----
>
> Thanks, Anand
>
>
>> E.g. when a page contains several blocks, and some blocks are dirty
>> and being written back, but also some sectors needs to be read.
>> If implemented properly, the not uptodate blocks can be properly read
>> meanwhile without waiting for the writeback to finish.
>>
>> But from the safety point of view, I strongly prefer to wait for the
>> folio writeback in such case.
>>
>> Especially considering all the existing get folio + read + wait for
>> writeback is from the time where we only consider sectorsize == page
>> size.
>>
>> We have enabled sectorsize < page size since 5.15, and we should have
>> dropped the wrong assumption for years.
>>
>>>
>>> You found two cases, truncate and defrag. Both are different I think,
>>> truncate comes from normal MM operations, while defrag is triggered by
>>> an ioctl (still trying to be in sync with MM).
>>>
>>> I'm not sure we can copy what other filesystems do, even if it's just on
>>> the basic principle of COW vs update in place + journaling.
>>
>> I do not think COW is making any difference. All the COW handling is
>> at delalloc time, meanwhile the folio get/lock/read/wait sequence is
>> more common for page dirtying (regular buffered write, defrag,
>> truncate are all in a similar situation).
>>
>> Page cache is here just to provide file content cache, it doesn't care
>> about if it's COW or NOT.
>>
>> Furthermore, COW is no longer our exclusive feature, XFS has supported
>> it for quite some time, and there is no special handling just for the
>> page cache.
>> (Meanwhile XFS and ext4 has much better blocksize < page size handling
>> than us for years)
>>
>>
>> And I have already explained in that case, waiting for the writeback
>> at folio get time is much safer (although reduces concurrency).
>>
>> Just for the data safety, I believe you need to provide a much
>> stronger argument than COW vs overwrite (which is completely unrelated).
>>
>>> We copy the
>>> and do the next update and don't have to care about previous state, so
>>> even a split between read and writeback does no harm.
>>
>> Although I love the csum/datacow, I do not see any strong reason not
>> to follow the more common (and IMHO safer) way to wait for the writeback.
>>
>> I have explained the possible race, and I do not think I need to
>> repeat that example again.
>>
>> If you didn't fully understand why my example, please let me know
>> where it's unclear that I can explain it better.
>>
>> Thanks,
>> Qu
>


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2024-10-15  4:41 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-14  4:36 [PATCH] btrfs: use FGP_STABLE to wait for folio writeback Qu Wenruo
2024-10-14 11:19 ` Anand Jain
2024-10-14 14:16 ` David Sterba
2024-10-14 20:55   ` Qu Wenruo
2024-10-14 20:59     ` Qu Wenruo
2024-10-15  0:31     ` David Sterba
2024-10-15  1:26       ` Qu Wenruo
2024-10-15  4:04         ` Anand Jain
2024-10-15  4:41           ` Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox